CN104537118A

CN104537118A - Microblog data processing method, device and system

Info

Publication number: CN104537118A
Application number: CN201510036778.2A
Authority: CN
Inventors: 李寿山; 王晶晶; 段湘煜; 周国栋
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2015-01-26
Filing date: 2015-01-26
Publication date: 2015-04-22
Anticipated expiration: 2035-01-26
Also published as: CN104537118B

Abstract

The application provides a microblog data processing method, device and system. The microblog data processing method comprises the following steps: calculating the corresponding to-be-classified positive-prediction subcondition probability and the corresponding to-be-classified negative-prediction subcondition probability of each to-be-classified characteristic value by using a maximum-entropy classifier when a to-be-predicated sample is preset as a positive or a negative sample; obtaining a to-be-classified positive-prediction condition probability and a to-be-classified negative-prediction condition probability; determining the to-be-detected sample as the positive sample when the comparison result shows that the to-be-classified positive-prediction subcondition probability is higher, and determining the to-be-detected sample as a negative sample when the comparison result shows that the to-be-classified negative-prediction subcondition probability is higher, thereby completing the predication of the category of the to-be-predicated sample; and determining that two account numbers which correspond to the to-be-predicated sample belong to the same user when the to-be-predicated sample is predicated as the positive sample, and determining that two account numbers which correspond to the to-be-predicated sample do not belong to the same user when the to-be-predicated sample is predicated as the negative sample, thereby completing the identification of a same user in different microblog sites.

Description

A kind of microblog data disposal route, Apparatus and system

Technical field

The application relates to natural language processing and field of social network, particularly a kind of microblog data disposal route, Apparatus and system.

Background technology

In recent years, along with the fast development of social networks, miniature blog (Micro-blog) enjoys the favor of user, if Sina's microblogging, Tengxun's microblogging are domestic well-known microblogging websites, by the end of in Dec, 2012, Sina microblogging registered user breaks through 5.03 hundred million, and Tengxun's microblogging then reaches 5.4 hundred million, and microblog users to be sent out rich amount every day and exceeded surprising 200,000,000.Because microblogging had both had broadcasting media characteristic, there is again social networks characteristic, therefore attracted numerous researchist to analyze and research to microblog data.

Wherein, during microblog data is analyzed and researched, identify that the same user under different microblogging website is important, because the same user that can identify under different microblogging website will be conducive to enterprise formulate advertisement putting accurately, contribute to studying same user and use the use motivational research of different social networks and correlation analysis thereof to help social networks operation with this better to develop social networks product.

But, still there is not a kind of effective method at present to identify the same user under different microblogging website.

Summary of the invention

For solving the problems of the technologies described above, the embodiment of the present application provides a kind of microblog data disposal route, Apparatus and system, and to reach the object of the identification to the same user under different microblogging website, technical scheme is as follows:

A kind of microblog data disposal route, comprising:

Feature extraction is carried out to sample to be tested, obtain feature extraction end value to be measured, wherein, described sample to be tested is a pair information of the first microblog account information and the second microblog account information composition, and belonging to the account that belonging to the account that described first microblog account information is corresponding, microblogging website is corresponding with described second microblog account information, microblogging website is different;

Determine that each numerical value that described feature extraction end value to be measured comprises is eigenwert to be sorted;

Use maximum entropy classifiers, calculate each eigenwert to be sorted when described sample to be tested is preset as positive class and negative class, corresponding to be sorted positive predictor conditional probability and negative predictor conditional probability to be sorted;

To be sorted positive predictor conditional probability corresponding for each eigenwert to be sorted is carried out multiplication, obtain positive predicted condition probability to be sorted, to be sorted negative predictor conditional probability corresponding for each eigenwert to be sorted is carried out multiplication, obtains negative predicted condition probability to be sorted;

The size of more described positive predicted condition probability to be sorted and described negative predicted condition probability to be sorted;

When comparative result is described positive predicted condition maximum probability to be sorted, determine that the classification of described sample to be tested is just;

When comparative result is described negative predicted condition maximum probability to be sorted, determine that the classification of described sample to be tested is negative;

Be timing in the classification of described sample to be tested, determine that two accounts that described sample to be tested is corresponding belong to same user;

When the classification of described sample to be tested is for bearing, determine that two accounts that described sample to be tested is corresponding do not belong to same user.

Preferably, described use maximum entropy classifiers, calculates each eigenwert to be sorted when described sample to be tested is preset as positive class and negative class, and corresponding to be sorted positive predictor conditional probability and the process of negative predictor conditional probability to be sorted, comprising:

Use maximum entropy objective function Equation calculate each eigenwert to be sorted respectively when y is respectively+1 and-1, corresponding to be sorted positive predictor conditional probability and negative predictor conditional probability to be sorted, wherein, this y is sample to be tested, and this x is eigenwert to be sorted, P _λ(y|x) be predictor conditional probability to be sorted, exp () for natural number e be the exponential function at the end, f _i() is binary feature function, described in

f_{i} () = \{\begin{matrix} 1, if x &Element; y \\ 0, others \end{matrix},

λ _ifor fundamental function value f _i(x, y) positive best initial weights when y is+1 or the negative best initial weights when y is-1 and the weights of different characteristic functional value corresponding to identical x are identical, for the function of suing for peace to each eigenwert characteristic of correspondence functional value to be sorted, for the function that data corresponding when being different value to y are sued for peace;

Wherein, described y is that the described sample to be tested of+1 expression is preset as positive class, described y is that the described sample to be tested of-1 expression is preset as negative class, the preset kind positive and negative of the corresponding described sample to be tested of each fundamental function value difference that each eigenwert to be sorted is corresponding, when calculating positive predictor conditional probability to be sorted, if eigenwert to be sorted is included in default eigenwert, then λ is positive best initial weights corresponding to this eigenwert to be sorted, otherwise λ is 0, when calculating negative predictor conditional probability to be sorted, if eigenwert to be sorted is included in described default eigenwert, then λ is negative best initial weights corresponding to this eigenwert to be sorted, otherwise λ is 0.

Preferably, described first microblog account information and described second microblog account information at least comprise separately:

User Identity number ID, the pet name, sex, age, location and bean vermicelli user and the ratio paying close attention to user.

Preferably, described feature extraction is carried out to sample to be tested, obtains the process of feature extraction end value to be measured, comprising:

Judge that whether the user ID in described first microblog account information is identical with the user ID in described second microblog account information, if identical, then represent feature extraction result first subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result first subvalue to be measured with numerical value 0;

Judge that whether the pet name in described first microblog account information is identical with the pet name in described second microblog account information, if identical, then represent feature extraction result second subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result second subvalue to be measured with numerical value 0;

Judge that whether the sex in described first microblog account information is identical with the sex in described second microblog account information, if identical, then represent feature extraction result the 3rd subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result the 3rd subvalue to be measured with numerical value 0;

Age in age in more described first microblog account information and described second microblog account information, if the age in the age in described first microblog account information and described second microblog account information does not all fill in, feature extraction result the 4th subvalue to be measured is represented with numerical value 0, if only have the age in a microblog account information to fill in described first microblog account information and described second microblog account information, then represent feature extraction result the 4th subvalue to be measured with numerical value 1, if the age in described first microblog account information is identical with the age in described second microblog account information, then represent feature extraction result the 4th subvalue to be measured with numerical value 2, if the age in described first microblog account information is not identical with the age in described second microblog account information, then represent feature extraction result the 4th subvalue to be measured with numerical value 3,

Judge that whether the location in described first microblog account information is identical with the location in described second microblog account information, if identical, then represent feature extraction result the 5th subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result the 5th subvalue to be measured with numerical value 0;

Judge whether bean vermicelli user in described first microblog account information belongs to same ratio scope with the bean vermicelli user paid close attention in the ratio of user and described second microblog account information with the ratio paying close attention to user, if, then represent feature extraction result the 6th subvalue to be measured with numerical value 1, if not, then feature extraction result the 6th subvalue to be measured is represented with numerical value 0;

Described feature extraction result first subvalue to be measured, described feature extraction result second subvalue to be measured, described feature extraction result the 3rd subvalue to be measured, described feature extraction result the 4th subvalue to be measured, described feature extraction result the 5th subvalue to be measured and described feature extraction result to be measured 6th subvalue are formed feature extraction end value to be measured.

Preferably, the training process of described maximum entropy classifiers comprises:

Obtain multiple different positive class sample and multiple different negative class sample, described positive class sample comprises two positive account information, two positive account information are respectively the account information of same user in two different microblogging websites, described negative class sample comprises two negative account letter informations, two negative account information belong to different user and its each self-corresponding account belongs to different microblogging websites, two microblogging websites corresponding with described negative class sample, two microblogging websites that described positive class sample is corresponding are identical, two microblogging websites corresponding with described positive class sample, two microblogging websites that described sample to be tested is corresponding are identical,

Respectively feature extraction is carried out to class sample negative described in class sample positive described in each and each, obtain corresponding positive training sample and negative training sample;

The numerical value determining positive training sample described in each and comprise in negative training sample described in each is eigenwert;

According to formula calculate each eigenwert respectively when each y is respectively+1 and-1 separately, corresponding positive predicted condition probability and negative predicted condition probability;

Wherein, described y is any one positive training sample or any one negative training sample, and described x is eigenwert, P _λ(y|x) be predicted condition probability, exp () for natural number e be the exponential function at the end, f _i() is binary feature function, described in

f_{i} () = \{\begin{matrix} 1, if x &Element; y \\ 0, others \end{matrix},

λ _ifor fundamental function value f _ithe weights of (x, y) and the weights of different characteristic functional value corresponding to identical x are identical, for the function of suing for peace to each eigenwert characteristic of correspondence functional value, for the function that data corresponding when being different value to y are sued for peace, described λ _iinitial value known;

Utilize GIS algorithm, adjust the positive predicted condition probability that each eigenwert is corresponding, until each eigenwert positive predicted condition convergence in probability separately, and λ corresponding to the positive predicted condition probability of each eigenwert being restrained separately is as the positive best initial weights of each eigenwert characteristic of correspondence functional value separately;

Utilize GIS algorithm, adjust the negative predicted condition probability that each eigenwert is corresponding, until each eigenwert negative predicted condition convergence in probability separately, and λ corresponding to the negative predicted condition probability of each eigenwert being restrained separately is as the negative best initial weights of each eigenwert characteristic of correspondence functional value separately.

Preferably, at least comprise in described positive account information: user ID, the pet name, sex, age, location and bean vermicelli user and the ratio paying close attention to user, described negative account information at least comprises: user ID, the pet name, sex, age, location and bean vermicelli user are with when paying close attention to the ratio of user, describedly carry out feature extraction to class sample negative described in class sample positive described in each and each respectively, the process obtaining corresponding positive training sample and negative training sample comprises:

Judge that whether the user ID in each positive class sample two positive account information is separately identical, if identical, then represent positive feature extraction result first subvalue with numerical value 1, if not identical, then represent positive feature extraction result first subvalue with numerical value 0;

Judge that whether the pet name in each positive class sample two positive account information is separately identical, if identical, then represent positive feature extraction result second subvalue with numerical value 1, if not identical, then represent positive feature extraction result second subvalue with numerical value 0;

Judge that whether the sex in each positive class sample two positive account information is separately identical, if identical, then represent positive feature extraction result the 3rd subvalue with numerical value 1, if not identical, then represent positive feature extraction result the 3rd subvalue with numerical value 0;

Age relatively in each positive class sample two positive account information separately, if the age in two positive account information does not all fill in, positive feature extraction result the 4th subvalue is represented with numerical value 0, if only have the age in a positive account information to fill in two positive account information, then represent positive feature extraction result the 4th subvalue with numerical value 1, if the age in two positive account information is identical, then represent positive feature extraction result the 4th subvalue with numerical value 2, if the age in two positive account information is not identical, then represent positive feature extraction result the 4th subvalue with numerical value 3;

Judge that whether the location in each positive class sample two positive account information is separately identical, if identical, then represent positive feature extraction result the 5th subvalue with numerical value 1, if not identical, then represent positive feature extraction result the 5th subvalue with numerical value 0;

Judge whether the bean vermicelli user in each positive class sample two positive account information separately belongs to same ratio scope with the ratio paying close attention to user, if, then represent positive feature extraction result the 6th subvalue with numerical value 1, if not, then represent positive feature extraction result the 6th subvalue with numerical value 0;

Each for each positive class sample self-corresponding described positive feature extraction result first subvalue, described positive feature extraction result second subvalue, described positive feature extraction result the 3rd subvalue, described positive feature extraction result the 4th subvalue, described positive feature extraction result the 5th subvalue and described positive feature extraction result the 6th subvalue are formed positive feature extraction end value, as each self-corresponding positive training sample of each positive class sample;

Judge that whether the user ID in each negative class sample two negative account information is separately identical, if identical, then represent negative feature extraction result first subvalue with numerical value 1, if not identical, then represent negative feature extraction result first subvalue with numerical value 0;

Judge that whether the pet name in each negative class sample two negative account information is separately identical, if identical, then represent negative feature extraction result second subvalue with numerical value 1, if not identical, then represent negative feature extraction result second subvalue with numerical value 0;

Judge that whether the sex in each negative class sample two negative account information is separately identical, if identical, then represent negative feature extraction result the 3rd subvalue with numerical value 1, if not identical, then represent negative feature extraction result the 3rd subvalue with numerical value 0;

Age relatively in each negative class sample two negative account information separately, if the age in two negative account information does not all fill in, negative feature extraction result the 4th subvalue is represented with numerical value 0, if only have the age in a negative account information to fill in two negative account information, then represent negative feature extraction result the 4th subvalue with numerical value 1, if the age in two negative account information is identical, then represent negative feature extraction result the 4th subvalue with numerical value 2, if the age in two negative account information is not identical, then represent negative feature extraction result the 4th subvalue with numerical value 3;

Judge that whether the location in each negative class sample two negative account information is separately identical, if identical, then represent negative feature extraction result the 5th subvalue with numerical value 1, if not identical, then represent negative feature extraction result the 5th subvalue with numerical value 0;

Judge whether the bean vermicelli user in each negative class sample two negative account information separately belongs to same ratio scope with the ratio paying close attention to user, if, then represent negative feature extraction result the 6th subvalue with numerical value 1, if not, then represent negative feature extraction result the 6th subvalue with numerical value 0;

Each for each negative class sample self-corresponding described negative feature extraction result first subvalue, described negative feature extraction result second subvalue, described negative feature extraction result the 3rd subvalue, described negative feature extraction result the 4th subvalue, described negative feature extraction result the 5th subvalue and described negative feature extraction result the 6th subvalue are formed negative feature extraction end value, as each self-corresponding negative training sample of each negative class sample.

A kind of microblog data treating apparatus, comprising:

Fisrt feature extracting unit, for carrying out feature extraction to sample to be tested, obtain feature extraction end value to be measured, wherein, described sample to be tested is a pair information of the first microblog account information and the second microblog account information composition, and belonging to the account that belonging to the account that described first microblog account information is corresponding, microblogging website is corresponding with described second microblog account information, microblogging website is different;

First determining unit, for determining that each numerical value that described feature extraction end value to be measured comprises is eigenwert to be sorted;

First computing unit, for using maximum entropy classifiers, calculates each eigenwert to be sorted when described sample to be tested is preset as positive class and negative class, corresponding to be sorted positive predictor conditional probability and negative predictor conditional probability to be sorted;

Second computing unit, for to be sorted positive predictor conditional probability corresponding for each eigenwert to be sorted is carried out multiplication, obtain positive predicted condition probability to be sorted, to be sorted negative predictor conditional probability corresponding for each eigenwert to be sorted is carried out multiplication, obtains negative predicted condition probability to be sorted;

Comparing unit, for the size of more described positive predicted condition probability to be sorted and described negative predicted condition probability to be sorted, when comparative result is described positive predicted condition maximum probability to be sorted, trigger the second determining unit and determine that the classification of described sample to be tested is just, when comparative result is described negative predicted condition maximum probability to be sorted, triggers the 3rd determining unit and determine that the classification of described sample to be tested is negative;

4th determining unit is timing for the classification at described sample to be tested, determines that two accounts that described sample to be tested is corresponding belong to same user;

5th determining unit, for when the classification of described sample to be tested is for bearing, determines that two accounts that described sample to be tested is corresponding do not belong to same user.

Preferably, described first computing unit comprises:

Computation subunit, for using maximum entropy objective function Equation calculate each eigenwert to be sorted respectively when y is respectively+1 and-1, corresponding to be sorted positive predictor conditional probability and negative predictor conditional probability to be sorted, wherein, this y is sample to be tested, and this x is eigenwert to be sorted, P _λ(y|x) be predictor conditional probability to be sorted, exp () for natural number e be the exponential function at the end, f _i() is binary feature function, described in

f_{i} () = \{\begin{matrix} 1, if x &Element; y \\ 0, others \end{matrix},

Preferably, at least comprise separately in described first microblog account information and described second microblog account information: User Identity number ID, the pet name, sex, age, location and bean vermicelli user are with when paying close attention to the ratio of user, and described fisrt feature extracting unit comprises:

First judgment sub-unit, whether identical with the user ID in described second microblog account information for judging the user ID in described first microblog account information, if identical, then represent feature extraction result first subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result first subvalue to be measured with numerical value 0;

Second judgment sub-unit, whether identical with the pet name in described second microblog account information for judging the pet name in described first microblog account information, if identical, then represent feature extraction result second subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result second subvalue to be measured with numerical value 0;

3rd judgment sub-unit, whether identical with the sex in described second microblog account information for judging the sex in described first microblog account information, if identical, then represent feature extraction result the 3rd subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result the 3rd subvalue to be measured with numerical value 0;

First compares subelement, for the age in the age in more described first microblog account information and described second microblog account information, if the age in the age in described first microblog account information and described second microblog account information does not all fill in, feature extraction result the 4th subvalue to be measured is represented with numerical value 0, if only have the age in a microblog account information to fill in described first microblog account information and described second microblog account information, then represent feature extraction result the 4th subvalue to be measured with numerical value 1, if the age in described first microblog account information is identical with the age in described second microblog account information, then represent feature extraction result the 4th subvalue to be measured with numerical value 2, if the age in described first microblog account information is not identical with the age in described second microblog account information, then represent feature extraction result the 4th subvalue to be measured with numerical value 3,

4th judgment sub-unit, whether identical with the location in described second microblog account information for judging the location in described first microblog account information, if identical, then represent feature extraction result the 5th subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result the 5th subvalue to be measured with numerical value 0;

5th judgment sub-unit, for judging whether bean vermicelli user in described first microblog account information belongs to same ratio scope with the bean vermicelli user paid close attention in the ratio of user and described second microblog account information with the ratio paying close attention to user, if, then represent feature extraction result the 6th subvalue to be measured with numerical value 1, if not, then feature extraction result the 6th subvalue to be measured is represented with numerical value 0;

First composition subelement, for forming feature extraction end value to be measured by described feature extraction result first subvalue to be measured, described feature extraction result second subvalue to be measured, described feature extraction result the 3rd subvalue to be measured, described feature extraction result the 4th subvalue to be measured, described feature extraction result the 5th subvalue to be measured and described feature extraction result to be measured 6th subvalue.

A kind of microblog data disposal system, comprise maximum entropy classifiers trainer and the microblog data treating apparatus as described in above-mentioned any one, wherein, described maximum entropy classifiers trainer comprises:

Acquiring unit, for obtaining multiple different positive class sample and multiple different negative class sample, described positive class sample comprises two positive account information, two positive account information are respectively the account information of same user in two different microblogging websites, described negative class sample comprises two negative account letter informations, two negative account information belong to different user and its each self-corresponding account belongs to different microblogging websites, two microblogging websites corresponding with described negative class sample, two microblogging websites that described positive class sample is corresponding are identical, two microblogging websites corresponding with described positive class sample, two microblogging websites that described sample to be tested is corresponding are identical, described positive account information at least comprises: user ID, the pet name, sex, age, the ratio of location and bean vermicelli user and concern user, described negative account information at least comprises: user ID, the pet name, sex, age, the ratio of location and bean vermicelli user and concern user,

Second feature extracting unit, for carrying out feature extraction to class sample negative described in class sample positive described in each and each respectively, obtains corresponding positive training sample and negative training sample;

6th determining unit is eigenwert for the numerical value determining positive training sample described in each and comprise in negative training sample described in each;

3rd computing unit, for foundation formula calculate each eigenwert respectively when each y is respectively+1 and-1 separately, corresponding positive predicted condition probability and negative predicted condition probability, wherein, described y is any one positive training sample or any one negative training sample, and described x is eigenwert, P _λ(y|x) be predicted condition probability, exp () for natural number e be the exponential function at the end, f _i() is binary feature function, described in

f_{i} () = \{\begin{matrix} 1, if x &Element; y \\ 0, others \end{matrix},

4th computing unit, for utilizing GIS algorithm, adjust the positive predicted condition probability that each eigenwert is corresponding, until each eigenwert positive predicted condition convergence in probability separately, and λ corresponding to the positive predicted condition probability of each eigenwert being restrained separately is as the positive best initial weights of each eigenwert characteristic of correspondence functional value separately;

5th computing unit, for utilizing GIS algorithm, adjust the negative predicted condition probability that each eigenwert is corresponding, until each eigenwert negative predicted condition convergence in probability separately, and λ corresponding to the negative predicted condition probability of each eigenwert being restrained separately is as the negative best initial weights of each eigenwert characteristic of correspondence functional value separately.

Preferably, at least comprise in described positive account information: user ID, the pet name, sex, age, location and bean vermicelli user and the ratio paying close attention to user, described negative account information at least comprises: user ID, the pet name, sex, age, location and bean vermicelli user are with when paying close attention to the ratio of user, and described second feature extracting unit comprises:

6th judgment sub-unit, whether identical for judging the user ID in each positive class sample two positive account information separately, if identical, then represent positive feature extraction result first subvalue with numerical value 1, if not identical, then represent positive feature extraction result first subvalue with numerical value 0;

7th judgment sub-unit, whether identical for judging the pet name in each positive class sample two positive account information separately, if identical, then represent positive feature extraction result second subvalue with numerical value 1, if not identical, then represent positive feature extraction result second subvalue with numerical value 0;

8th judgment sub-unit, whether identical for judging the sex in each positive class sample two positive account information separately, if identical, then represent positive feature extraction result the 3rd subvalue with numerical value 1, if not identical, then represent positive feature extraction result the 3rd subvalue with numerical value 0;

Second compares subelement, for comparing the age in each positive class sample two positive account information separately, if the age in two positive account information does not all fill in, positive feature extraction result the 4th subvalue is represented with numerical value 0, if only have the age in a positive account information to fill in two positive account information, then represent positive feature extraction result the 4th subvalue with numerical value 1, if the age in two positive account information is identical, then represent positive feature extraction result the 4th subvalue with numerical value 2, if the age in two positive account information is not identical, then represent positive feature extraction result the 4th subvalue with numerical value 3,

9th judgment sub-unit, whether identical for judging the location in each positive class sample two positive account information separately, if identical, then represent positive feature extraction result the 5th subvalue with numerical value 1, if not identical, then represent positive feature extraction result the 5th subvalue with numerical value 0;

Tenth judgment sub-unit, for judging whether the bean vermicelli user in each positive class sample two positive account information separately belongs to same ratio scope with the ratio paying close attention to user, if, then represent positive feature extraction result the 6th subvalue with numerical value 1, if not, then positive feature extraction result the 6th subvalue is represented with numerical value 0;

Second composition subelement, for each for each positive class sample self-corresponding described positive feature extraction result first subvalue, described positive feature extraction result second subvalue, described positive feature extraction result the 3rd subvalue, described positive feature extraction result the 4th subvalue, described positive feature extraction result the 5th subvalue and described positive feature extraction result the 6th subvalue are formed positive feature extraction end value, as each self-corresponding positive training sample of each positive class sample;

11 judgment sub-unit, whether identical for judging the user ID in each negative class sample two negative account information separately, if identical, then represent negative feature extraction result first subvalue with numerical value 1, if not identical, then represent negative feature extraction result first subvalue with numerical value 0;

12 judgment sub-unit, whether identical for judging the pet name in each negative class sample two negative account information separately, if identical, then represent negative feature extraction result second subvalue with numerical value 1, if not identical, then represent negative feature extraction result second subvalue with numerical value 0;

13 judgment sub-unit, whether identical for judging the sex in each negative class sample two negative account information separately, if identical, then represent negative feature extraction result the 3rd subvalue with numerical value 1, if not identical, then represent negative feature extraction result the 3rd subvalue with numerical value 0;

3rd compares subelement, for comparing the age in each negative class sample two negative account information separately, if the age in two negative account information does not all fill in, negative feature extraction result the 4th subvalue is represented with numerical value 0, if only have the age in a negative account information to fill in two negative account information, then represent negative feature extraction result the 4th subvalue with numerical value 1, if the age in two negative account information is identical, then represent negative feature extraction result the 4th subvalue with numerical value 2, if the age in two negative account information is not identical, then represent negative feature extraction result the 4th subvalue with numerical value 3,

13 judgment sub-unit, whether identical for judging the location in each negative class sample two negative account information separately, if identical, then represent negative feature extraction result the 5th subvalue with numerical value 1, if not identical, then represent negative feature extraction result the 5th subvalue with numerical value 0;

15 judgment sub-unit, for judging whether the bean vermicelli user in each negative class sample two negative account information separately belongs to same ratio scope with the ratio paying close attention to user, if, then represent negative feature extraction result the 6th subvalue with numerical value 1, if not, then negative feature extraction result the 6th subvalue is represented with numerical value 0;

3rd composition subelement, for each for each negative class sample self-corresponding described negative feature extraction result first subvalue, described negative feature extraction result second subvalue, described negative feature extraction result the 3rd subvalue, described negative feature extraction result the 4th subvalue, described negative feature extraction result the 5th subvalue and described negative feature extraction result the 6th subvalue are formed negative feature extraction end value, as each self-corresponding negative training sample of each negative class sample.

Compared with prior art, the beneficial effect of the application is:

In this application, maximum entropy classifiers is used to calculate each eigenwert to be sorted when described sample to be tested is preset as positive class and negative class, corresponding to be sorted positive predictor conditional probability and negative predictor conditional probability to be sorted; To be sorted positive predictor conditional probability corresponding for each eigenwert to be sorted is carried out multiplication, obtain positive predicted condition probability to be sorted, to be sorted negative predictor conditional probability corresponding for each eigenwert to be sorted is carried out multiplication, obtains negative predicted condition probability to be sorted; The size of more described positive predicted condition probability to be sorted and described negative predicted condition probability to be sorted; When comparative result is described positive predicted condition maximum probability to be sorted, determine that the classification of described sample to be tested is just; When comparative result is described negative predicted condition maximum probability to be sorted, determine that the classification of described sample to be tested is negative, achieve and use maximum entropy classifiers to the prediction of sample to be tested classification.

Be timing in the classification doping sample to be tested, determine that two accounts that sample to be tested is corresponding belong to same user, when the classification doping sample to be tested is negative, determine that two accounts that sample to be tested is corresponding do not belong to same user, thus achieve the identification to the same user under different microblogging website.

Accompanying drawing explanation

In order to be illustrated more clearly in the technical scheme in the embodiment of the present application, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the application, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is a kind of process flow diagram of the microblog data disposal route that the application provides;

Fig. 2 is a kind of process flow diagram of the training process of the maximum entropy classifiers that the application provides;

Fig. 3 is a kind of logical organization schematic diagram of the microblog data treating apparatus that the application provides;

Fig. 4 is a kind of logical organization schematic diagram of the microblog data disposal system that the application provides;

Fig. 5 is a kind of logical organization schematic diagram of the maximum entropy classifiers trainer that the application provides.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present application, be clearly and completely described the technical scheme in the embodiment of the present application, obviously, described embodiment is only some embodiments of the present application, instead of whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not making the every other embodiment obtained under creative work prerequisite, all belong to the scope of the application's protection.

Embodiment one

In the present embodiment, show the microblog data disposal route that the application provides, refer to Fig. 1, it illustrates a kind of process flow diagram of the microblog data disposal route that the application provides, can comprise the following steps:

Step S11: carry out feature extraction to sample to be tested, obtains feature extraction end value to be measured.

Wherein, described sample to be tested is a pair information of the first microblog account information and the second microblog account information composition, and belonging to the account that belonging to the account that described first microblog account information is corresponding, microblogging website is corresponding with described second microblog account information, microblogging website is different.Such as, first microblog account information a represents, second microblog account information b represents, then sample to be tested is (a, b), and microblogging website is different belonging to the account that belonging to account corresponding to a, microblogging website is corresponding with b, microblogging website belonging to account as corresponding in a is Sina's microblogging website, and microblogging website belonging to the account that b is corresponding is Tengxun's microblogging website.

Step S12: determine that each numerical value that described feature extraction end value to be measured comprises is eigenwert to be sorted.

Step S13: use maximum entropy classifiers, calculate each eigenwert to be sorted when described sample to be tested is preset as positive class and negative class, corresponding to be sorted positive predictor conditional probability and negative predictor conditional probability to be sorted.

Step S14: to be sorted positive predictor conditional probability corresponding for each eigenwert to be sorted is carried out multiplication, obtain positive predicted condition probability to be sorted, to be sorted negative predictor conditional probability corresponding for each eigenwert to be sorted is carried out multiplication, obtains negative predicted condition probability to be sorted.

Step S15: the size of more described positive predicted condition probability to be sorted and described negative predicted condition probability to be sorted.

When comparative result is described positive predicted condition maximum probability to be sorted, perform step S16; When comparative result is described negative predicted condition maximum probability to be sorted, perform step S17.

Step S16: determine that the classification of described sample to be tested is just.

Step S17: determine that the classification of described sample to be tested is negative.

Step S18: be timing in the classification of described sample to be tested, determines that two accounts that described sample to be tested is corresponding belong to same user.

Step S19: when the classification of described sample to be tested is for bearing, determine that two accounts that described sample to be tested is corresponding do not belong to same user.

Embodiment two

In the present embodiment, what illustrate is use maximum entropy classifiers, calculates each eigenwert to be sorted when described sample to be tested is preset as positive class and negative class, corresponding to be sorted positive predictor conditional probability and the detailed process of negative predictor conditional probability to be sorted.

Use maximum entropy classifiers, calculate each eigenwert to be sorted when described sample to be tested is preset as positive class and negative class, corresponding to be sorted positive predictor conditional probability and the process of negative predictor conditional probability to be sorted are specially:

f_{i} () = \{\begin{matrix} 1, if x &Element; y \\ 0, others \end{matrix},

λ _ifor fundamental function value f _i(x, y) positive best initial weights when y is+1 or the negative best initial weights when y is-1 and the weights of different characteristic functional value corresponding to identical x are identical, for the function of suing for peace to each eigenwert characteristic of correspondence functional value to be sorted, for the function that data corresponding when being different value to y are sued for peace.

Presetting eigenwert is in training maximum entropy classifiers process, the numerical value comprised in training sample.

Now illustrate to foundation formula calculate each eigenwert to be sorted respectively when y is respectively+1 and-1, corresponding to be sorted positive predictor conditional probability and the process of negative predictor conditional probability to be sorted are described.

Such as, eigenwert to be sorted comprises numerical value 0, and 1,1,3,1,1.And eigenwert is 0,1,2, then, when to be sorted positive predictor conditional probability and the negative predictor conditional probability to be sorted of evaluation 3, λ is 0.

When calculating the to be sorted positive predictor conditional probability of numerical value 0 correspondence in eigenwert to be sorted, λ is the positive best initial weights of 0 correspondence, and when calculating the to be sorted negative predictor conditional probability of numerical value 0 correspondence in eigenwert to be sorted, λ is the negative best initial weights of 0 correspondence.

When calculating the to be sorted positive predictor conditional probability of numerical value 1 correspondence in eigenwert to be sorted, λ is the positive best initial weights of 1 correspondence, and when calculating the to be sorted negative predictor conditional probability of numerical value 1 correspondence in eigenwert to be sorted, λ is the negative best initial weights of 1 correspondence.

For numerical value 0 in eigenwert to be sorted, to calculating each eigenwert to be sorted respectively when y is respectively+1 and-1, corresponding to be sorted positive predictor conditional probability and negative predictor conditional probability to be sorted are described.The positive best initial weights of numerical value 0 correspondence in eigenwert to be sorted is made to be λ ' ₁, negative best initial weights is λ ' ₂.When y is+1, in eigenwert to be sorted, numerical value 0 characteristic of correspondence functional value when the preset kind of sample to be tested is respectively positive and negative is respectively f ₁(1,0) and f _-1(1,0); When y is-1, in eigenwert to be sorted, numerical value 0 characteristic of correspondence functional value when the preset kind of sample to be tested is respectively positive and negative is respectively f ₁(-1,0) and f _-1(-1,0).

When y is+1, according to formula can obtain

P_{λ} (1 | 0) = \frac{\exp (\underset{i = 1}{Σ} λ_{i} f_{i} (1,0))}{\underset{y}{Σ} \exp (\underset{i = 1}{Σ} λ_{i} f_{i} (1,0))} = \frac{\exp ({λ^{'}}_{1} f_{1} (1,0) + {λ^{'}}_{1} f_{- 1} (1,0))}{\exp ({λ^{'}}_{1} f_{1} (1,0) + {λ^{'}}_{1} f_{- 1} (1,0)) + \exp ({λ^{'}}_{2} f_{1} (- 1,0) + {λ^{'}}_{2} f_{- 1} (- 1,0))} .

\frac{\exp ({λ^{'}}_{1} f_{1} (1,0) + {λ^{'}}_{1} f_{- 1} (1,0))}{\exp ({λ^{'}}_{1} f_{1} (1,0) + {λ^{'}}_{1} f_{- 1} (1,0)) + \exp ({λ^{'}}_{2} f_{1} (- 1,0) + {λ^{'}}_{2} f_{- 1} (- 1,0))}

The i.e. to be sorted positive predictor conditional probability of numerical value 0 in eigenwert to be sorted.

When y is-1, according to formula can obtain

P_{λ} (1 | 0) = \frac{\exp (\underset{i = 1}{Σ} λ_{i} f_{i} (1,0))}{\underset{y}{Σ} \exp (\underset{i = 1}{Σ} λ_{i} f_{i} (1,0))} = \frac{\exp ({λ^{'}}_{1} f_{1} (1,0) + {λ^{'}}_{1} f_{- 1} (1,0))}{\exp ({λ^{'}}_{1} f_{1} (1,0) + {λ^{'}}_{1} f_{- 1} (1,0)) + \exp ({λ^{'}}_{2} f_{1} (- 1,0) + {λ^{'}}_{2} f_{- 1} (- 1,0))} .

\frac{\exp ({λ^{'}}_{1} f_{1} (1,0) + {λ^{'}}_{1} f_{- 1} (1,0))}{\exp ({λ^{'}}_{1} f_{1} (1,0) + {λ^{'}}_{1} f_{- 1} (1,0)) + \exp ({λ^{'}}_{2} f_{1} (- 1,0) + {λ^{'}}_{2} f_{- 1} (- 1,0))}

The i.e. to be sorted negative predictor conditional probability of numerical value 0 in eigenwert to be sorted.

The computation process of each self-corresponding positive predictor conditional probability to be sorted of each eigenwert to be sorted and negative predictor conditional probability to be sorted, as above-mentioned for the computation process of numerical value 0 in eigenwert to be sorted in the present embodiment, does not repeat them here.

Embodiment three

In embodiment one and embodiment two, described first microblog account information and described second microblog account information can at least comprise separately: user ID (identify label number, IDentity), the pet name, sex, age, location and bean vermicelli user and the ratio paying close attention to user, that is, the first microblog account information can at least comprise: user ID, the pet name, sex, age, location and bean vermicelli user and the ratio paying close attention to user; Second microblog account information can at least comprise: user ID, the pet name, sex, age, location and bean vermicelli user and the ratio paying close attention to user.

In the present embodiment, at least comprise separately in the first microblog account information and described second microblog account information: user ID, the pet name, sex, age, location and bean vermicelli user are with when paying close attention to the ratio of user, describedly carry out feature extraction to sample to be tested, the process obtaining feature extraction end value to be measured is specifically as follows:

A11: judge that whether the user ID in described first microblog account information is identical with the user ID in described second microblog account information, if identical, then represent feature extraction result first subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result first subvalue to be measured with numerical value 0.

A12: judge that whether the pet name in described first microblog account information is identical with the pet name in described second microblog account information, if identical, then represent feature extraction result second subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result second subvalue to be measured with numerical value 0.

A13: judge that whether the sex in described first microblog account information is identical with the sex in described second microblog account information, if identical, then represent feature extraction result the 3rd subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result the 3rd subvalue to be measured with numerical value 0.

A14: the age in the age in more described first microblog account information and described second microblog account information, if the age in the age in described first microblog account information and described second microblog account information does not all fill in, feature extraction result the 4th subvalue to be measured is represented with numerical value 0, if only have the age in a microblog account information to fill in described first microblog account information and described second microblog account information, then represent feature extraction result the 4th subvalue to be measured with numerical value 1, if the age in described first microblog account information is identical with the age in described second microblog account information, then represent feature extraction result the 4th subvalue to be measured with numerical value 2, if the age in described first microblog account information is not identical with the age in described second microblog account information, then represent feature extraction result the 4th subvalue to be measured with numerical value 3.

A15: judge that whether the location in described first microblog account information is identical with the location in described second microblog account information, if identical, then represent feature extraction result the 5th subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result the 5th subvalue to be measured with numerical value 0.

A16: judge whether bean vermicelli user in described first microblog account information belongs to same ratio scope with the bean vermicelli user paid close attention in the ratio of user and described second microblog account information with the ratio paying close attention to user, if, then represent feature extraction result the 6th subvalue to be measured with numerical value 1, if not, then feature extraction result the 6th subvalue to be measured is represented with numerical value 0.

In the present embodiment, bean vermicelli user with pay close attention to the preset ratio scope of user and can be divided into: [0,0.8], (0.8,1.5), [1.5,3] and be greater than 3.

A17: described feature extraction result first subvalue to be measured, described feature extraction result second subvalue to be measured, described feature extraction result the 3rd subvalue to be measured, described feature extraction result the 4th subvalue to be measured, described feature extraction result the 5th subvalue to be measured and described feature extraction result to be measured 6th subvalue are formed feature extraction end value to be measured.

Embodiment four

In the present embodiment, what illustrate is the training process of maximum entropy classifiers, refers to Fig. 2, it illustrates a kind of process flow diagram of the training process of the maximum entropy classifiers that the application provides, can comprise the following steps:

Step S21: obtain multiple different positive class sample and multiple different negative class sample.

Wherein, described positive class sample comprises two positive account information, two positive account information are respectively the account information of same user in two different microblogging websites, described negative class sample comprises two negative account letter informations, two negative account information belong to different user and its each self-corresponding account belongs to different microblogging websites, two microblogging websites corresponding with described negative class sample, two microblogging websites that described positive class sample is corresponding are identical, and two microblogging websites corresponding with described positive class sample, two microblogging websites that described sample to be tested is corresponding are identical.

Two negative account information belong to different user and its each self-corresponding account belongs to different microblogging website i.e. two negative account information belongs to different user and two each self-corresponding accounts of negative account information belong to different microblogging websites.

In the present embodiment, the generative process of positive class sample and negative class sample specifically can see step B11 and step B12, as follows:

Step B11: collect the account information in multiple sampling each comfortable two different microblogging websites of user.

Any one sampling user all has an account in two different microblogging websites.As, sample user U1 at Sina's microblogging website Zhong Youyige Sina account A, at Tengxun's microblogging website Zhong Youyige Tengxun account B.

Now to sample user U1, the account information of any one sampling user of collection in two different microblogging websites is described, if the account information sampling the Sina account A of user U1 is a, the account information of the Tengxun account B of sampling user U1 is b, then collect the account information a and the account information b that take user U1.

Because the process of collecting the account information in each sampling each comfortable two different microblogging websites of user is identical, therefore the present embodiment is only described the collection process of the account information of any one sampling user in two different microblogging websites, as follows: collect the account information of this sampling user in the first microblogging website and collect the account information of this sampling user in the second microblogging website, wherein the first microblogging website and the second microblogging website are different microblogging websites.

The process of collecting the account information of this sampling user in the first microblogging website is:

C11: build first user queue.

C12: this sampling user is added first user queue.

C13: take out this sampling user from first user queue, API (the Application Programming Interface provided by the first microblogging website, application programming interface) extract the account information of this sampling user in the first microblogging website, and the account information of this sampling user in the first microblogging website is joined in first user queue.

When the account information of this sampling of follow-up use user in the first microblogging website, can extract from first user queue.

The process of collecting the account information of this sampling user in the second microblogging website is:

D11: build the second Subscriber Queue.

D12: this sampling user is added the second Subscriber Queue.

D13: take out this sampling user from the second Subscriber Queue, the API provided by the second microblogging website extracts the account information of this sampling user in the second microblogging website, and the account information of this sampling user in the second microblogging website is joined in the second Subscriber Queue.

When the account information of this sampling of follow-up use user in the second microblogging website, can extract from the second Subscriber Queue.

Step B12: partner the account information in each sampling each comfortable two different microblogging websites of user information respectively, as positive class sample; From the respective account information of any two sampling users, same sampling user will do not belonged to and two account information in different microblogging website form one group of information, as negative class sample.

Partner the account information in each sampling each comfortable two different microblogging websites of user information respectively, is the process of artificial mark as positive class sample.

From the respective account information of any two sampling users, will not belong to same sampling user and two account information in different microblogging website form one group of information, also be the process manually marked as negative class sample.

Such as, the account information of sampling user U1 in two different microblogging websites is respectively a, b, the account information of sampling user U2 in two different microblogging websites is respectively c, d, the account that a is corresponding and account corresponding to c belong to same microblogging website, the account that b is corresponding and account corresponding to d belong to same microblogging website, belonging to the account that belonging to the account that the account that a is corresponding is corresponding with c, microblogging website is corresponding from b and account corresponding to d, microblogging website is different, then (a, b) with (c, d) be positive class sample, (a, d) and (b, c) is negative class sample.

Step S22: respectively feature extraction is carried out to class sample negative described in class sample positive described in each and each, obtain corresponding positive training sample and negative training sample.

Step S23: determining positive training sample described in each and bearing the numerical value comprised in training sample described in each is eigenwert.

In the present embodiment, the numerical value comprised in positive training sample described in each and negative training sample described in each is default eigenwert involved in embodiment two.

Step S24: according to formula calculate each eigenwert respectively when each y is respectively+1 and-1 separately, corresponding positive predicted condition probability and negative predicted condition probability.

f_{i} () = \{\begin{matrix} 1, if x &Element; y \\ 0, others \end{matrix},

λ _ifor fundamental function value f _ithe weights of (x, y) and the weights of different characteristic functional value corresponding to identical x are identical, for the function of suing for peace to each eigenwert characteristic of correspondence functional value, for the function that data corresponding when being different value to y are sued for peace, described λ _iinitial value known.

Due to λ _iinitial value known, y is given value, therefore can according to formula calculate each eigenwert respectively when each y is respectively+1 and-1 separately, corresponding positive predicted condition probability and negative predicted condition probability.

Now illustrate to foundation formula, calculate each eigenwert respectively when each y is respectively+1 and-1 separately, corresponding positive predicted condition probability and the process of negative predicted condition probability are described in detail.Such as, two each and every one training samples, sequence number is respectively 1 and 2, sequence number be 1 training sample be positive training sample, and positive training sample comprises numerical value 0,1,1,2,1,1, wherein the corresponding sequence number of numerical value 0 is the training sample of 2 is negative training sample, and negative training sample comprises numerical value 0,0,0,1,0,0.

For first numerical value 0 (numerical value that namely user ID is corresponding), to when being respectively+1 and-1 at y, corresponding positive predicted condition probability and the process of negative predicted condition probability are described.

0 (numerical value that namely user ID is corresponding) all exists in positive training sample He in negative training sample, and therefore when y is+1,0 (numerical value that namely user ID is corresponding) corresponding two fundamental function values, are respectively f ₁(1,0), f ₂(1,0), when y is-1, corresponding two fundamental function values, are respectively f ₁(-1,0), f ₂(-1,0).Because the weights of different characteristic functional value corresponding to identical x are identical, the f that therefore 0 (numerical value that namely user ID is corresponding) is corresponding ₁the weights of (1,0) and corresponding f ₂the weights of (1,0) are identical, are designated as λ ₁; The f that 0 (numerical value that namely user ID is corresponding) is corresponding ₁the weights of (-1,0) and corresponding f ₂the weights of (-1,0) are identical, are designated as λ ₂.

When y is+1, according to formula can obtain

P_{λ} (1 | 0) = \frac{\exp (\underset{i = 1}{Σ} λ_{i} f_{i} (1,0))}{\underset{y}{Σ} \exp (\underset{i = 1}{Σ} λ_{i} f_{i} (1,0))} =

P_{λ} (1 | 0) = \frac{\exp (λ_{1} f_{1} (1,0) + λ_{1} f_{2} (1,0))}{\exp (λ_{1} f_{1} (1,0) + λ_{1} f_{2} (1,0)) + \exp (λ_{2} f_{1} (- 1,0) + λ_{2} f_{2} (- 1,0))} .

P_{λ} (1 | 0) = \frac{\exp (λ_{2} f_{1} (1,0) + λ_{2} f_{2} (1,0))}{\exp (λ_{1} f_{1} (1,0) + λ_{1} f_{2} (1,0)) + \exp (λ_{2} f_{1} (- 1,0) + λ_{2} f_{2} (- 1,0))}

The i.e. positive predicted condition probability of 0 (numerical value that namely user ID is corresponding).

When y is-1, according to formula can obtain

P_{λ} (- 1 | 0) = \frac{\exp (\underset{i = 1}{Σ} λ_{i} f_{i} (- 1,0))}{\underset{y}{Σ} \exp (\underset{i = 1}{Σ} λ_{i} f_{i} (- 1,0))} =

P_{λ} (- 1 | 0) = \frac{\exp (λ_{2} f_{1} (- 1,0) + λ_{2} f_{2} (- 1,0))}{\exp (λ_{1} f_{1} (1,0) + λ_{1} f_{2} (1,0)) + \exp (λ_{2} f_{1} (- 1,0) + λ_{2} f_{2} (- 1,0))} .

P_{λ} (- 1 | 0) = \frac{\exp (λ_{2} f_{1} (- 1,0) + λ_{2} f_{2} (- 1,0))}{\exp (λ_{1} f_{1} (1,0) + λ_{1} f_{2} (1,0)) + \exp (λ_{2} f_{1} (- 1,0) + λ_{2} f_{2} (- 1,0))}

The i.e. negative predicted condition probability of 0 (numerical value that namely user ID is corresponding).

Due to λ _iinitial value known, therefore λ ₁and λ ₂value known, can P be calculated _λ(1|0) and P _λ(-1|0).

The numerical value 1 that positive training sample comprises, 1,2,1, the positive predicted condition probability of computation process as above-mentioned 0 (numerical value that namely user ID is corresponding) correspondence of 1 each self-corresponding positive predicted condition probability and negative predicted condition probability and the computation process of negative predicted condition probability, do not repeat them here.

The computation process of each self-corresponding positive predicted condition probability of each eigenwert and negative predicted condition probability, as above-mentioned for the computation process of 0 (numerical value that namely user ID is corresponding), does not repeat them here yet.

Step S25: utilize GIS algorithm, adjust the positive predicted condition probability that each eigenwert is corresponding, until each eigenwert positive predicted condition convergence in probability separately, and λ corresponding to the positive predicted condition probability of each eigenwert being restrained separately is as the positive best initial weights of each eigenwert characteristic of correspondence functional value separately.

Utilize GIS algorithm, adjust the positive predicted condition probability that each eigenwert is corresponding, until the principle of each eigenwert positive predicted condition convergence in probability is separately existing principle, do not repeat them here.

In the present embodiment, the respective positive predicted condition convergence in probability of each eigenwert and each eigenwert positive predicted condition probability separately reach maximal value.

Step S26: utilize GIS algorithm, adjust the negative predicted condition probability that each eigenwert is corresponding, until each eigenwert negative predicted condition convergence in probability separately, and λ corresponding to the negative predicted condition probability of each eigenwert being restrained separately is as the negative best initial weights of each eigenwert characteristic of correspondence functional value separately.

Utilize GIS algorithm, adjust the negative predicted condition probability that each eigenwert is corresponding, until the principle of each eigenwert negative predicted condition convergence in probability is separately existing principle, do not repeat them here.

In the present embodiment, the respective negative predicted condition convergence in probability of each eigenwert and each eigenwert negative predicted condition probability separately reach maximal value.

The maximum entropy classifiers obtained after step S21-step S26 trains, may be used for calculating each eigenwert to be sorted when described sample to be tested is preset as positive class and negative class, corresponding to be sorted positive predictor conditional probability and negative predictor conditional probability to be sorted, detailed process is as shown in embodiment two.

In the present embodiment, positive account information can at least comprise: user ID, the pet name, sex, age, location and bean vermicelli user and the ratio paying close attention to user, and described negative account information can at least comprise: user ID, the pet name, sex, age, location and bean vermicelli user and the ratio paying close attention to user.

At least comprise in described positive account information: user ID, the pet name, sex, age, location and bean vermicelli user and the ratio paying close attention to user, described negative account information at least comprises: user ID, the pet name, sex, age, location and bean vermicelli user are with when paying close attention to the ratio of user, describedly respectively feature extraction is carried out to class sample negative described in class sample positive described in each and each, obtain corresponding positive training sample and the process of negative training sample, be specially:

E11: judge that whether the user ID in each positive class sample two positive account information is separately identical, if identical, then represents positive feature extraction result first subvalue with numerical value 1, if not identical, then represents positive feature extraction result first subvalue with numerical value 0.

E12: judge that whether the pet name in each positive class sample two positive account information is separately identical, if identical, then represents positive feature extraction result second subvalue with numerical value 1, if not identical, then represents positive feature extraction result second subvalue with numerical value 0.

E13: judge that whether the sex in each positive class sample two positive account information is separately identical, if identical, then represents positive feature extraction result the 3rd subvalue with numerical value 1, if not identical, then represents positive feature extraction result the 3rd subvalue with numerical value 0.

E14: compare the age in each positive class sample two positive account information separately, if the age in two positive account information does not all fill in, positive feature extraction result the 4th subvalue is represented with numerical value 0, if only have the age in a positive account information to fill in two positive account information, then represent positive feature extraction result the 4th subvalue with numerical value 1, if the age in two positive account information is identical, then represent positive feature extraction result the 4th subvalue with numerical value 2, if the age in two positive account information is not identical, then represent positive feature extraction result the 4th subvalue with numerical value 3.

E15: judge that whether the location in each positive class sample two positive account information is separately identical, if identical, then represents positive feature extraction result the 5th subvalue with numerical value 1, if not identical, then represents positive feature extraction result the 5th subvalue with numerical value 0.

E16: judge whether the bean vermicelli user in each positive class sample two positive account information separately belongs to same ratio scope with the ratio paying close attention to user, if, then represent positive feature extraction result the 6th subvalue with numerical value 1, if not, then represent positive feature extraction result the 6th subvalue with numerical value 0.

E17: each for each positive class sample self-corresponding described positive feature extraction result first subvalue, described positive feature extraction result second subvalue, described positive feature extraction result the 3rd subvalue, described positive feature extraction result the 4th subvalue, described positive feature extraction result the 5th subvalue and described positive feature extraction result the 6th subvalue are formed positive feature extraction end value, as each self-corresponding positive training sample of each positive class sample.

E18: judge that whether the user ID in each negative class sample two negative account information is separately identical, if identical, then represents negative feature extraction result first subvalue with numerical value 1, if not identical, then represents negative feature extraction result first subvalue with numerical value 0.

E19: judge that whether the pet name in each negative class sample two negative account information is separately identical, if identical, then represents negative feature extraction result second subvalue with numerical value 1, if not identical, then represents negative feature extraction result second subvalue with numerical value 0.

E110: judge that whether the sex in each negative class sample two negative account information is separately identical, if identical, then represents negative feature extraction result the 3rd subvalue with numerical value 1, if not identical, then represents negative feature extraction result the 3rd subvalue with numerical value 0.

E111: compare the age in each negative class sample two negative account information separately, if the age in two negative account information does not all fill in, negative feature extraction result the 4th subvalue is represented with numerical value 0, if only have the age in a negative account information to fill in two negative account information, then represent negative feature extraction result the 4th subvalue with numerical value 1, if the age in two negative account information is identical, then represent negative feature extraction result the 4th subvalue with numerical value 2, if the age in two negative account information is not identical, then represent negative feature extraction result the 4th subvalue with numerical value 3.

E112: judge that whether the location in each negative class sample two negative account information is separately identical, if identical, then represents negative feature extraction result the 5th subvalue with numerical value 1, if not identical, then represents negative feature extraction result the 5th subvalue with numerical value 0.

E113: judge whether the bean vermicelli user in each negative class sample two negative account information separately belongs to same ratio scope with the ratio paying close attention to user, if, then represent negative feature extraction result the 6th subvalue with numerical value 1, if not, then represent negative feature extraction result the 6th subvalue with numerical value 0.

E114: each for each negative class sample self-corresponding described negative feature extraction result first subvalue, described negative feature extraction result second subvalue, described negative feature extraction result the 3rd subvalue, described negative feature extraction result the 4th subvalue, described negative feature extraction result the 5th subvalue and described negative feature extraction result the 6th subvalue are formed negative feature extraction end value, as each self-corresponding negative training sample of each negative class sample.

In the present embodiment, now illustrate and the process of step e 11-step e 17 is described, such as, the positive account information a of user U1 in two different microblogging websites and b forms positive class sample (a, b), associative list 1 carries out feature extraction to how aligning class sample (a, b), obtains positive training sample and is described.

Table 1

As shown in Table 1, positive feature extraction result first subvalue is 0, and positive feature extraction result second subvalue is 1, positive feature extraction result the 3rd subvalue is 1, and positive feature extraction result the 4th subvalue is 2, and positive feature extraction result the 5th subvalue is 1, positive feature extraction result the 6th subvalue is 1, then positive feature extraction end value is a line numerical value, namely { 0,1,1,2,1,1}.

In the above-described embodiments, microblogging website belonging to the account that described first microblog account information is corresponding can but be not limited to Sina's microblogging website, microblogging website belonging to the account that described second microblog account information is corresponding can but be not limited to Tengxun's microblogging website.

Embodiment five

Corresponding with said method embodiment, present embodiments provide a kind of microblog data treating apparatus, refer to Fig. 3, it illustrates a kind of logical organization schematic diagram of the microblog data treating apparatus that the application provides, microblog data treating apparatus comprises: fisrt feature extracting unit 31, first determining unit 32, first computing unit 33, second computing unit 34, comparing unit 35, second determining unit 36, the 3rd determining unit 37, the 4th determining unit 38 and the 5th determining unit 39.

Fisrt feature extracting unit 31, for carrying out feature extraction to sample to be tested, obtain feature extraction end value to be measured, wherein, described sample to be tested is a pair information of the first microblog account information and the second microblog account information composition, and belonging to the account that belonging to the account that described first microblog account information is corresponding, microblogging website is corresponding with described second microblog account information, microblogging website is different.

First determining unit 32, for determining that each numerical value that described feature extraction end value to be measured comprises is eigenwert to be sorted.

First computing unit 33, for using maximum entropy classifiers, calculates each eigenwert to be sorted when described sample to be tested is preset as positive class and negative class, corresponding to be sorted positive predictor conditional probability and negative predictor conditional probability to be sorted.

Second computing unit 34, for to be sorted positive predictor conditional probability corresponding for each eigenwert to be sorted is carried out multiplication, obtain positive predicted condition probability to be sorted, to be sorted negative predictor conditional probability corresponding for each eigenwert to be sorted is carried out multiplication, obtains negative predicted condition probability to be sorted.

Comparing unit 35, for the size of more described positive predicted condition probability to be sorted and described negative predicted condition probability to be sorted, when comparative result is described positive predicted condition maximum probability to be sorted, trigger the second determining unit 36 and determine that the classification of described sample to be tested is just, when comparative result is described negative predicted condition maximum probability to be sorted, triggers the 3rd determining unit 37 and determine that the classification of described sample to be tested is negative.

4th determining unit 38 is timing for the classification at described sample to be tested, determines that two accounts that described sample to be tested is corresponding belong to same user.

5th determining unit 39, for when the classification of described sample to be tested is for bearing, determines that two accounts that described sample to be tested is corresponding do not belong to same user.

In the present embodiment, the first computing unit 33 specifically comprises: computation subunit, for using maximum entropy objective function Equation calculate each eigenwert to be sorted respectively when y is respectively+1 and-1, corresponding to be sorted positive predictor conditional probability and negative predictor conditional probability to be sorted, wherein, this y is sample to be tested, and this x is eigenwert to be sorted, P _λ(y|x) be predictor conditional probability to be sorted, exp () for natural number e be the exponential function at the end, f _i() is binary feature function, described in

f_{i} () = \{\begin{matrix} 1, if x &Element; y \\ 0, others \end{matrix},

In said apparatus, at least comprise separately in described first microblog account information and described second microblog account information: User Identity number ID, the pet name, sex, age, location and bean vermicelli user are with when paying close attention to the ratio of user, and described fisrt feature extracting unit 31 specifically comprises:

First judgment sub-unit, whether identical with the user ID in described second microblog account information for judging the user ID in described first microblog account information, if identical, then represent feature extraction result first subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result first subvalue to be measured with numerical value 0.

Second judgment sub-unit, whether identical with the pet name in described second microblog account information for judging the pet name in described first microblog account information, if identical, then represent feature extraction result second subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result second subvalue to be measured with numerical value 0.

3rd judgment sub-unit, whether identical with the sex in described second microblog account information for judging the sex in described first microblog account information, if identical, then represent feature extraction result the 3rd subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result the 3rd subvalue to be measured with numerical value 0.

First compares subelement, for the age in the age in more described first microblog account information and described second microblog account information, if the age in the age in described first microblog account information and described second microblog account information does not all fill in, feature extraction result the 4th subvalue to be measured is represented with numerical value 0, if only have the age in a microblog account information to fill in described first microblog account information and described second microblog account information, then represent feature extraction result the 4th subvalue to be measured with numerical value 1, if the age in described first microblog account information is identical with the age in described second microblog account information, then represent feature extraction result the 4th subvalue to be measured with numerical value 2, if the age in described first microblog account information is not identical with the age in described second microblog account information, then represent feature extraction result the 4th subvalue to be measured with numerical value 3.

4th judgment sub-unit, whether identical with the location in described second microblog account information for judging the location in described first microblog account information, if identical, then represent feature extraction result the 5th subvalue to be measured with numerical value 1, if not identical, then represent feature extraction result the 5th subvalue to be measured with numerical value 0.

5th judgment sub-unit, for judging whether bean vermicelli user in described first microblog account information belongs to same ratio scope with the bean vermicelli user paid close attention in the ratio of user and described second microblog account information with the ratio paying close attention to user, if, then represent feature extraction result the 6th subvalue to be measured with numerical value 1, if not, then feature extraction result the 6th subvalue to be measured is represented with numerical value 0.

Embodiment six

In the present embodiment, show a kind of microblog data disposal system, refer to Fig. 4, it illustrates a kind of logical organization schematic diagram of the microblog data disposal system that the application provides, microblog data disposal system comprises: maximum entropy classifiers trainer 41 and microblog data treating apparatus 42.

The microblog data treating apparatus of concrete structure as shown in embodiment five of microblog data treating apparatus 42, does not repeat them here.

In the present embodiment, the concrete structure of maximum entropy classifiers trainer 41 refers to Fig. 5, it illustrates a kind of logical organization schematic diagram of the maximum entropy classifiers trainer that the application provides, maximum entropy classifiers trainer comprises: acquiring unit 51, second feature extracting unit 52, the 6th determining unit 53, the 3rd computing unit 54, the 4th computing unit 55 and the 5th computing unit 56.

Acquiring unit 51, for obtaining multiple different positive class sample and multiple different negative class sample.

Described positive class sample comprises two positive account information, two positive account information are respectively the account information of same user in two different microblogging websites, described negative class sample comprises two negative account letter informations, two negative account information belong to different user and its each self-corresponding account belongs to different microblogging websites, two microblogging websites corresponding with described negative class sample, two microblogging websites that described positive class sample is corresponding are identical, two microblogging websites corresponding with described positive class sample, two microblogging websites that described sample to be tested is corresponding are identical, described positive account information at least comprises: user ID, the pet name, sex, age, the ratio of location and bean vermicelli user and concern user, described negative account information at least comprises: user ID, the pet name, sex, age, the ratio of location and bean vermicelli user and concern user.

Second feature extracting unit 52, for carrying out feature extraction to class sample negative described in class sample positive described in each and each respectively, obtains corresponding positive training sample and negative training sample.

6th determining unit 53 is eigenwert for the numerical value determining positive training sample described in each and comprise in negative training sample described in each.

3rd computing unit 54, for foundation formula calculate each eigenwert respectively when each y is respectively+1 and-1 separately, corresponding positive predicted condition probability and negative predicted condition probability, wherein, described y is any one positive training sample or any one negative training sample, and described x is eigenwert, P _λ(y|x) be predicted condition probability, exp () for natural number e be the exponential function at the end, f _i() is binary feature function, described in

f_{i} () = \{\begin{matrix} 1, if x &Element; y \\ 0, others \end{matrix},

4th computing unit 55, for utilizing GIS algorithm, adjust the positive predicted condition probability that each eigenwert is corresponding, until each eigenwert positive predicted condition convergence in probability separately, and λ corresponding to the positive predicted condition probability of each eigenwert being restrained separately is as the positive best initial weights of each eigenwert characteristic of correspondence functional value separately.

5th computing unit 56, for utilizing GIS algorithm, adjust the negative predicted condition probability that each eigenwert is corresponding, until each eigenwert negative predicted condition convergence in probability separately, and λ corresponding to the negative predicted condition probability of each eigenwert being restrained separately is as the negative best initial weights of each eigenwert characteristic of correspondence functional value separately.

In the present embodiment, at least comprise in described positive account information: user ID, the pet name, sex, age, location and bean vermicelli user and the ratio paying close attention to user, described negative account information at least comprises: user ID, the pet name, sex, age, location and bean vermicelli user are with when paying close attention to the ratio of user, and described second feature extracting unit 52 specifically comprises:

6th judgment sub-unit, whether identical for judging the user ID in each positive class sample two positive account information separately, if identical, then represent positive feature extraction result first subvalue with numerical value 1, if not identical, then represent positive feature extraction result first subvalue with numerical value 0.

7th judgment sub-unit, whether identical for judging the pet name in each positive class sample two positive account information separately, if identical, then represent positive feature extraction result second subvalue with numerical value 1, if not identical, then represent positive feature extraction result second subvalue with numerical value 0.

8th judgment sub-unit, whether identical for judging the sex in each positive class sample two positive account information separately, if identical, then represent positive feature extraction result the 3rd subvalue with numerical value 1, if not identical, then represent positive feature extraction result the 3rd subvalue with numerical value 0.

Second compares subelement, for comparing the age in each positive class sample two positive account information separately, if the age in two positive account information does not all fill in, positive feature extraction result the 4th subvalue is represented with numerical value 0, if only have the age in a positive account information to fill in two positive account information, then represent positive feature extraction result the 4th subvalue with numerical value 1, if the age in two positive account information is identical, then represent positive feature extraction result the 4th subvalue with numerical value 2, if the age in two positive account information is not identical, then represent positive feature extraction result the 4th subvalue with numerical value 3.

9th judgment sub-unit, whether identical for judging the location in each positive class sample two positive account information separately, if identical, then represent positive feature extraction result the 5th subvalue with numerical value 1, if not identical, then represent positive feature extraction result the 5th subvalue with numerical value 0.

Tenth judgment sub-unit, for judging whether the bean vermicelli user in each positive class sample two positive account information separately belongs to same ratio scope with the ratio paying close attention to user, if, then represent positive feature extraction result the 6th subvalue with numerical value 1, if not, then positive feature extraction result the 6th subvalue is represented with numerical value 0.

Second composition subelement, for each for each positive class sample self-corresponding described positive feature extraction result first subvalue, described positive feature extraction result second subvalue, described positive feature extraction result the 3rd subvalue, described positive feature extraction result the 4th subvalue, described positive feature extraction result the 5th subvalue and described positive feature extraction result the 6th subvalue are formed positive feature extraction end value, as each self-corresponding positive training sample of each positive class sample.

11 judgment sub-unit, whether identical for judging the user ID in each negative class sample two negative account information separately, if identical, then represent negative feature extraction result first subvalue with numerical value 1, if not identical, then represent negative feature extraction result first subvalue with numerical value 0.

12 judgment sub-unit, whether identical for judging the pet name in each negative class sample two negative account information separately, if identical, then represent negative feature extraction result second subvalue with numerical value 1, if not identical, then represent negative feature extraction result second subvalue with numerical value 0.

13 judgment sub-unit, whether identical for judging the sex in each negative class sample two negative account information separately, if identical, then represent negative feature extraction result the 3rd subvalue with numerical value 1, if not identical, then represent negative feature extraction result the 3rd subvalue with numerical value 0.

3rd compares subelement, for comparing the age in each negative class sample two negative account information separately, if the age in two negative account information does not all fill in, negative feature extraction result the 4th subvalue is represented with numerical value 0, if only have the age in a negative account information to fill in two negative account information, then represent negative feature extraction result the 4th subvalue with numerical value 1, if the age in two negative account information is identical, then represent negative feature extraction result the 4th subvalue with numerical value 2, if the age in two negative account information is not identical, then represent negative feature extraction result the 4th subvalue with numerical value 3.

13 judgment sub-unit, whether identical for judging the location in each negative class sample two negative account information separately, if identical, then represent negative feature extraction result the 5th subvalue with numerical value 1, if not identical, then represent negative feature extraction result the 5th subvalue with numerical value 0.

It should be noted that, each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar part mutually see.For device class embodiment, due to itself and embodiment of the method basic simlarity, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.

Finally, also it should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.

For convenience of description, various unit is divided into describe respectively with function when describing above device.Certainly, the function of each unit can be realized in same or multiple software and/or hardware when implementing the application.

As seen through the above description of the embodiments, those skilled in the art can be well understood to the mode that the application can add required general hardware platform by software and realizes.Based on such understanding, the technical scheme of the application can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product can be stored in storage medium, as ROM/RAM, magnetic disc, CD etc., comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) perform the method described in some part of each embodiment of the application or embodiment.

A kind of microblog data disposal route provided the application above, Apparatus and system are described in detail, apply specific case herein to set forth the principle of the application and embodiment, the explanation of above embodiment is just for helping method and the core concept thereof of understanding the application; Meanwhile, for one of ordinary skill in the art, according to the thought of the application, all will change in specific embodiments and applications, in sum, this description should not be construed as the restriction to the application.

Claims

1. a microblog data disposal route, is characterized in that, comprising:

2. method according to claim 1, it is characterized in that, described use maximum entropy classifiers, calculate each eigenwert to be sorted when described sample to be tested is preset as positive class and negative class, corresponding to be sorted positive predictor conditional probability and the process of negative predictor conditional probability to be sorted, comprising:

f_{i} () = \{\begin{matrix} 1, if x &Element; y \\ 0, others \end{matrix},

3. method according to claim 1 and 2, is characterized in that, described first microblog account information and described second microblog account information at least comprise separately:

4. method according to claim 3, is characterized in that, describedly carries out feature extraction to sample to be tested, obtains the process of feature extraction end value to be measured, comprising:

5. method according to claim 1, is characterized in that, the training process of described maximum entropy classifiers comprises:

f_{i} () = \{\begin{matrix} 1, if x &Element; y \\ 0, others \end{matrix},

6. method according to claim 5, it is characterized in that, at least comprise in described positive account information: user ID, the pet name, sex, age, location and bean vermicelli user and the ratio paying close attention to user, described negative account information at least comprises: user ID, the pet name, sex, age, location and bean vermicelli user are with when paying close attention to the ratio of user, describedly carry out feature extraction to class sample negative described in class sample positive described in each and each respectively, the process obtaining corresponding positive training sample and negative training sample comprises:

7. a microblog data treating apparatus, is characterized in that, comprising:

8. device according to claim 7, is characterized in that, described first computing unit comprises:

f_{i} () = \{\begin{matrix} 1, if x &Element; y \\ 0, others \end{matrix},

9. the device according to claim 7 or 8, it is characterized in that, at least comprise separately in described first microblog account information and described second microblog account information: User Identity number ID, the pet name, sex, age, location and bean vermicelli user are with when paying close attention to the ratio of user, and described fisrt feature extracting unit comprises:

10. a microblog data disposal system, is characterized in that, comprises maximum entropy classifiers trainer and the microblog data treating apparatus as described in claim 7-9 any one, and wherein, described maximum entropy classifiers trainer comprises:

f_{i} () = \{\begin{matrix} 1, if x &Element; y \\ 0, others \end{matrix},

11. systems according to claim 10, it is characterized in that, at least comprise in described positive account information: user ID, the pet name, sex, age, location and bean vermicelli user and the ratio paying close attention to user, described negative account information at least comprises: user ID, the pet name, sex, age, location and bean vermicelli user are with when paying close attention to the ratio of user, and described second feature extracting unit comprises: