CN104537118B

CN104537118B - A kind of microblog data processing method, apparatus and system

Info

Publication number: CN104537118B
Application number: CN201510036778.2A
Authority: CN
Inventors: 李寿山; 王晶晶; 段湘煜; 周国栋
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2015-01-26
Filing date: 2015-01-26
Publication date: 2017-12-26
Anticipated expiration: 2035-01-26
Also published as: CN104537118A

Abstract

This application provides a kind of microblog data processing method, apparatus and system, each characteristic value to be sorted is calculated when sample to be tested is preset as positive class and negative class using maximum entropy classifiers in methods described, it is corresponding to be sorted just to predict sub- conditional probability and the negative sub- conditional probability of prediction to be sorted；Then positive predicted condition probability to be sorted and negative predicted condition probability to be sorted are obtained；In the case where comparative result is positive predicted condition maximum probability to be sorted, determine the classification of sample to be tested for just；In the case where comparative result is negative predicted condition maximum probability to be sorted, the classification for determining sample to be tested is negative, realizes the prediction to sample to be tested classification.It is timing predicting the classification of sample to be tested, determine that two accounts belong to same user corresponding to sample to be tested, when the classification for predicting sample to be tested is bears, determine that two accounts are not belonging to same user corresponding to sample to be tested, it is achieved thereby that the identification to the same user under different microblogging websites.

Description

A kind of microblog data processing method, apparatus and system

Technical field

The application is related to natural language processing and field of social network, more particularly to a kind of microblog data processing method, dress Put and system.

Background technology

In recent years, as the fast development of social networks, miniature blog (Micro-blog) enjoy the favor of user, such as Sina weibo, Tengxun's microblogging are domestic well-known microblogging websites, and by the end of in December, 2012, Sina weibo registered user breaks through 5.03 hundred million, Tengxun's microblogging has then reached 5.4 hundred million, and hair wins amount more than surprising 200,000,000 to microblog users daily.Because microblogging had both had There is broadcasting media characteristic, there is social networks characteristic again, therefore attracted numerous researchers to carry out analysis to microblog data and ground Study carefully.

Wherein, in being analyzed and researched to microblog data, identify that the same user under different microblogging websites is important , because can recognize that the same user under different microblogging websites is beneficial to enterprise's formulation accurately advertisement putting, help Social networks is helped to transport with this using motivational research and its correlation analysis using different social networks in studying same user Battalion preferably develops social networks product.

But there presently does not exist a kind of effective method to identify the same user under different microblogging websites.

The content of the invention

In order to solve the above technical problems, the embodiment of the present application provides a kind of microblog data processing method, apparatus and system, with Reach the purpose of the identification to the same user under different microblogging websites, technical scheme is as follows：

A kind of microblog data processing method, including：

Feature extraction is carried out to sample to be tested, obtains feature extraction end value to be measured, wherein, the sample to be tested is first A pair of information of microblog account information and the second microblog account information composition, account institute corresponding to the first microblog account information It is different to belong to microblogging website microblogging website affiliated with account corresponding to the second microblog account information；

Determine that each numerical value that the feature extraction end value to be measured is included is characteristic value to be sorted；

Using maximum entropy classifiers, calculate each characteristic value to be sorted and be preset as positive class and negative class in the sample to be tested When, it is corresponding to be sorted just to predict sub- conditional probability and the sub- conditional probability of negative prediction to be sorted；

Just predicting that sub- conditional probability carries out multiplication by be sorted corresponding to each characteristic value to be sorted, obtain it is to be sorted just Predicted condition probability, the negative sub- conditional probability of prediction to be sorted corresponding to each characteristic value to be sorted is subjected to multiplication, treated The negative predicted condition probability of classification；

Compare the size of the positive predicted condition probability to be sorted and the negative predicted condition probability to be sorted；

In the case where comparative result is the positive predicted condition maximum probability to be sorted, the class of the sample to be tested is determined Wei not be just；

In the case where comparative result is the negative predicted condition maximum probability to be sorted, the class of the sample to be tested is determined It is not negative；

It is timing in the classification of the sample to be tested, determines that two accounts belong to same use corresponding to the sample to be tested Family；

When the classification of the sample to be tested is bears, determine that two accounts are not belonging to same use corresponding to the sample to be tested Family.

Preferably, it is described to use maximum entropy classifiers, calculate each characteristic value to be sorted and be preset as in the sample to be tested When positive class and negative class, the corresponding process to be sorted for just predicting sub- conditional probability and the negative sub- conditional probability of prediction to be sorted, including：

Use maximum entropy objective function EquationEach spy to be sorted is calculated respectively Value indicative is respectively+1 and when -1 in y, it is corresponding it is to be sorted just predicting sub- conditional probability and the negative sub- conditional probability of prediction to be sorted, its In, the y is sample to be tested, and the x is characteristic value to be sorted, P_λ(y | x) it is the sub- conditional probability of prediction to be sorted, exp () is nature Count the exponential function that e is bottom, f_i() is binary feature function, describedλ_iIt is characterized functional value f_i(x,y) The weights of different characteristic functional value corresponding to positive best initial weights when y is+1 or the negative best initial weights when y is -1 and identical x It is identical,For the function summed to characteristic function value corresponding to each characteristic value to be sorted,For to y is different value when The function that corresponding data are summed；

Wherein, the y is that+1 expression sample to be tested is preset as positive class, and the y is that -1 expression sample to be tested is pre- Negative class is set to, each characteristic function value corresponding to each characteristic value to be sorted is corresponding to the preset kind of the sample to be tested just respectively With it is negative, calculate it is to be sorted just predict sub- conditional probability when, if characteristic value to be sorted is included in default characteristic value, λ is should Positive best initial weights corresponding to characteristic value to be sorted, otherwise λ is 0, when calculating the sub- conditional probability of negative prediction to be sorted, if to be sorted Characteristic value is included in the default characteristic value, then λ is that best initial weights are born corresponding to the characteristic value to be sorted, and otherwise λ is 0.

Preferably, the first microblog account information and the second microblog account information each comprise at least：

The ratio of User Identity number ID, the pet name, sex, age, location and bean vermicelli user and concern user.

Preferably, it is described that feature extraction is carried out to sample to be tested, the process of feature extraction end value to be measured is obtained, including：

Judging the ID in the ID in the first microblog account information and the second microblog account information is It is no identical, if identical, the subvalue of feature extraction result first to be measured is represented with numerical value 1, if differing, represents to treat with numerical value 0 Survey feature extraction the first subvalue of result；

Judge the pet name in the pet name and the second microblog account information in the first microblog account information whether phase Together, it is if identical, the subvalue of feature extraction result second to be measured is represented with numerical value 1, if differing, spy to be measured is represented with numerical value 0 Sign extracts the subvalue of result second；

Judge sex in the sex and the second microblog account information in the first microblog account information whether phase Together, it is if identical, the subvalue of feature extraction result the 3rd to be measured is represented with numerical value 1, if differing, spy to be measured is represented with numerical value 0 Sign extracts the subvalue of result the 3rd；

Compare the age in the age and the second microblog account information in the first microblog account information, if described The age in age and the second microblog account information in first microblog account information does not fill in, represents to treat with numerical value 0 The subvalue of feature extraction result the 4th is surveyed, if only having one in the first microblog account information and the second microblog account information Age in microblog account information has filled in, then the subvalue of feature extraction result the 4th to be measured is represented with numerical value 1, if described first is micro- Age in rich account information is identical with the age in the second microblog account information, then represents that feature to be measured is taken out with numerical value 2 The subvalue of result the 4th is taken, if the age in the age and the second microblog account information in the first microblog account information is not It is identical, then represent the subvalue of feature extraction result the 4th to be measured with numerical value 3；

Judging the location in the location in the first microblog account information and the second microblog account information is It is no identical, if identical, the subvalue of feature extraction result the 5th to be measured is represented with numerical value 1, if differing, represents to treat with numerical value 0 Survey the subvalue of feature extraction result the 5th；

Judge ratio and the second microblogging account of the bean vermicelli user in the first microblog account information with concern user Whether the ratio of bean vermicelli user and concern user in number information belong to same ratio scope, if so, then representing to be measured with numerical value 1 The subvalue of feature extraction result the 6th, if it is not, then representing the subvalue of feature extraction result the 6th to be measured with numerical value 0；

By the subvalue of the feature extraction result first to be measured, subvalue of feature extraction result second to be measured, described to be measured The subvalue of feature extraction result the 3rd, the subvalue of feature extraction result the 4th to be measured, of feature extraction result the 5th to be measured Value and the subvalue of feature extraction result the 6th to be measured form feature extraction end value to be measured.

Preferably, the training process of the maximum entropy classifiers includes：

Multiple different positive class samples and multiple different negative class samples are obtained, the positive class sample includes two positive accounts Information, two positive account informations are respectively account information of the same user in two different microblogging websites, the negative class sample Including two negative account letter informations, two negative account informations belong to different user and its each self-corresponding account belongs to different micro- Rich website, two microblogging websites corresponding to the positive class sample are identical with two microblogging websites corresponding to the negative class sample, institute It is identical with two microblogging websites corresponding to the positive class sample to state two microblogging websites corresponding to sample to be tested；

Feature extraction is carried out to each positive class sample and each negative class sample respectively, obtains corresponding positive training Sample and negative training sample；

The numerical value for determining to include in each Positive training sample and each negative training sample is characterized value；

According to formulaEach characteristic value is calculated respectively in each y each to distinguish During for+1 with -1, corresponding positive predicted condition probability and negative predicted condition probability；

Wherein, the y is any one Positive training sample or any one negative training sample, and the x is characterized value, P_λ(y | it is x) predicted condition probability, exp () is the exponential function that natural number e is bottom, f_i() is binary feature function, describedλ_iIt is characterized functional value f_iThe weights phase of different characteristic functional value corresponding to the weights of (x, y) and identical x Together,For the function summed to characteristic function value corresponding to each characteristic value,For to y is different value when corresponding number According to the function summed, the λ_iInitial value known to；

Using GIS algorithms, positive predicted condition probability corresponding to each characteristic value is adjusted, until each characteristic value is respective just Predicted condition convergence in probability, and using each characteristic value each λ corresponding to convergent positive predicted condition probability as each characteristic value The positive best initial weights of each self-corresponding characteristic function value；

Using GIS algorithms, adjust and predicted condition probability is born corresponding to each characteristic value, until each characteristic value is respective negative Predicted condition convergence in probability, and using each characteristic value each λ corresponding to convergent negative predicted condition probability as each characteristic value The negative best initial weights of each self-corresponding characteristic function value.

Preferably, comprised at least in the positive account information：ID, the pet name, sex, age, location and bean vermicelli are used Family and the ratio of concern user, the negative account information comprise at least：ID, the pet name, sex, age, location and bean vermicelli It is described that each positive class sample and each negative class sample are carried out respectively in the case of ratio of the user with paying close attention to user Feature extraction, obtaining the process of corresponding Positive training sample and negative training sample includes：

Judge whether the ID in each positive class respective two positive account informations of sample is identical, if identical, uses number Value 1 represents the positive subvalue of feature extraction result first, if differing, the positive subvalue of feature extraction result first is represented with numerical value 0；

Judge whether the pet name in each positive class respective two positive account informations of sample is identical, if identical, with numerical value 1 The positive subvalue of feature extraction result second is represented, if differing, the positive subvalue of feature extraction result second is represented with numerical value 0；

Judge whether the sex in each positive class respective two positive account informations of sample is identical, if identical, with numerical value 1 The positive subvalue of feature extraction result the 3rd is represented, if differing, the positive subvalue of feature extraction result the 3rd is represented with numerical value 0；

Age in more each positive class respective two positive account informations of sample, if the age in two positive account informations Do not fill in, the positive subvalue of feature extraction result the 4th is represented with numerical value 0, if only having a positive account letter in two positive account informations Age in breath has filled in, then represents the positive subvalue of feature extraction result the 4th with numerical value 1, if the age in two positive account informations It is identical, then the positive subvalue of feature extraction result the 4th is represented with numerical value 2, if the age in two positive account informations differs, used Numerical value 3 represents the positive subvalue of feature extraction result the 4th；

Judge whether the location in each positive class respective two positive account informations of sample is identical, if identical, uses number Value 1 represents the positive subvalue of feature extraction result the 5th, if differing, the positive subvalue of feature extraction result the 5th is represented with numerical value 0；

Judge bean vermicelli user in each positive class respective two positive account informations of sample with pay close attention to user ratio whether Belong to same ratio scope, if so, then the positive subvalue of feature extraction result the 6th is represented with numerical value 1, if it is not, then being represented with numerical value 0 The positive subvalue of feature extraction result the 6th；

By the first subvalue of each positive each self-corresponding positive feature extraction result of class sample, the positive feature extraction result Second subvalue, the positive subvalue of feature extraction result the 3rd, the positive subvalue of feature extraction result the 4th, the positive feature extraction As a result the 5th subvalue and the positive subvalue of feature extraction result the 6th form positive feature extraction end value, as each positive class sample Each self-corresponding Positive training sample；

Judge whether the ID in the respective two negative account informations of each negative class sample is identical, if identical, uses number Value 1 represents that negative feature extracts the subvalue of result first, if differing, represents that negative feature extracts the subvalue of result first with numerical value 0；

Judge whether the pet name in the respective two negative account informations of each negative class sample is identical, if identical, with numerical value 1 Represent that negative feature extracts the subvalue of result second, if differing, represent that negative feature extracts the subvalue of result second with numerical value 0；

Judge whether the sex in the respective two negative account informations of each negative class sample is identical, if identical, with numerical value 1 Represent that negative feature extracts the subvalue of result the 3rd, if differing, represent that negative feature extracts the subvalue of result the 3rd with numerical value 0；

Age in more each respective two negative account informations of negative class sample, if the age in two negative account informations Do not fill in, represent that negative feature extracts the subvalue of result the 4th with numerical value 0, if only having a negative account letter in two negative account informations Age in breath has filled in, then represents that negative feature extracts the subvalue of result the 4th with numerical value 1, if the age in two negative account informations It is identical, then represent that negative feature extracts the subvalue of result the 4th with numerical value 2, if the age in two negative account informations differs, use Numerical value 3 represents that negative feature extracts the subvalue of result the 4th；

Judge whether the location in the respective two negative account informations of each negative class sample is identical, if identical, uses number Value 1 represents that negative feature extracts the subvalue of result the 5th, if differing, represents that negative feature extracts the subvalue of result the 5th with numerical value 0；

Judge bean vermicelli user in the respective two negative account informations of each negative class sample and concern user ratio whether Belong to same ratio scope, if so, then representing that negative feature extracts the subvalue of result the 6th with numerical value 1, if it is not, then being represented with numerical value 0 Negative feature extracts the subvalue of result the 6th；

By each negative each self-corresponding negative feature of class sample extracts the subvalue of result first, the negative feature extracts result Second subvalue, the negative feature extract the subvalue of result the 3rd, the negative feature extracts the subvalue of result the 4th, the negative feature extracts As a result the 5th subvalue and the negative feature extract the subvalue of result the 6th and form negative feature extraction end value, as each negative class sample Each self-corresponding negative training sample.

A kind of microblog data processing unit, including：

Fisrt feature extracting unit, for carrying out feature extraction to sample to be tested, feature extraction end value to be measured is obtained, its In, the sample to be tested is a pair of information that the first microblog account information and the second microblog account information form, and described first is micro- The affiliated microblogging website of account corresponding to the affiliated microblogging website of account corresponding to rich account information and the second microblog account information It is different；

First determining unit, for determining that each numerical value that the feature extraction end value to be measured is included is spy to be sorted Value indicative；

First computing unit, for using maximum entropy classifiers, calculating each characteristic value to be sorted in the sample to be tested It is corresponding to be sorted just to predict sub- conditional probability and the sub- conditional probability of negative prediction to be sorted when being preset as positive class and negative class；

Second computing unit, for just predicting that sub- conditional probability is multiplied by be sorted corresponding to each characteristic value to be sorted Computing, positive predicted condition probability to be sorted is obtained, by the negative sub- conditional probability of prediction to be sorted corresponding to each characteristic value to be sorted Multiplication is carried out, obtains negative predicted condition probability to be sorted；

Comparing unit, for the positive predicted condition probability to be sorted and the negative predicted condition probability to be sorted Size, in the case where comparative result is the positive predicted condition maximum probability to be sorted, the second determining unit of triggering determines institute The classification of sample to be tested is stated as just, in the case where comparative result is the negative predicted condition maximum probability to be sorted, triggering the Three determining units determine that the classification of the sample to be tested is negative；

4th determining unit, it is timing for the classification in the sample to be tested, determines two corresponding to the sample to be tested Individual account belongs to same user；

5th determining unit, for when the classification of the sample to be tested is bears, determining two corresponding to the sample to be tested Individual account is not belonging to same user.

Preferably, first computing unit includes：

Computation subunit, for using maximum entropy objective function EquationCount respectively Each characteristic value to be sorted is calculated when y is respectively+1 and -1, the corresponding sub- conditional probability and to be sorted negative pre- to be sorted just predicted Sub- conditional probability is surveyed, wherein, the y is sample to be tested, and the x is characteristic value to be sorted, P_λ(y | x) it is that the sub- condition of prediction to be sorted is general Rate, exp () are the exponential function that natural number e is bottom, f_i() is binary feature function, describedλ_iFor spy Levy functional value f_iIt is different corresponding to the positive best initial weights of (x, y) when y is+1 or the negative best initial weights when y is -1 and identical x The weights of characteristic function value are identical,For the function summed to characteristic function value corresponding to each characteristic value to be sorted,For to y is different value when the function summed of corresponding data；

Preferably, each comprised at least in the first microblog account information and the second microblog account information：User In the case of the ratio of identity number ID, the pet name, sex, age, location and bean vermicelli user and concern user, described the One feature extraction unit includes：

First judgment sub-unit, for judging the ID in the first microblog account information and the second microblogging account Whether the ID in number information is identical, if identical, the subvalue of feature extraction result first to be measured is represented with numerical value 1, if not phase Together, then the subvalue of feature extraction result first to be measured is represented with numerical value 0；

Second judgment sub-unit, for judging the pet name in the first microblog account information and second microblog account Whether the pet name in information is identical, if identical, the subvalue of feature extraction result second to be measured is represented with numerical value 1, if differing, The subvalue of feature extraction result second to be measured is represented with numerical value 0；

3rd judgment sub-unit, for judging the sex in the first microblog account information and second microblog account Whether the sex in information is identical, if identical, the subvalue of feature extraction result the 3rd to be measured is represented with numerical value 1, if differing, The subvalue of feature extraction result the 3rd to be measured is represented with numerical value 0；

First comparing subunit, for the age in the first microblog account information and second microblog account Age in information, if the age in age and the second microblog account information in the first microblog account information is not Fill in, the subvalue of feature extraction result the 4th to be measured is represented with numerical value 0, if the first microblog account information and second microblogging The age only having in account information in a microblog account information has filled in, then represents feature extraction result to be measured the with numerical value 1 Four subvalues, if the age in the first microblog account information is identical with the age in the second microblog account information, use Numerical value 2 represents the subvalue of feature extraction result the 4th to be measured, if age in the first microblog account information and described second micro- Age in rich account information differs, then represents the subvalue of feature extraction result the 4th to be measured with numerical value 3；

4th judgment sub-unit, for judging the location in the first microblog account information and the second microblogging account Whether the location in number information is identical, if identical, the subvalue of feature extraction result the 5th to be measured is represented with numerical value 1, if not phase Together, then the subvalue of feature extraction result the 5th to be measured is represented with numerical value 0；

5th judgment sub-unit, for the ratio for judging the bean vermicelli user in the first microblog account information with paying close attention to user Whether the ratio of bean vermicelli user and concern user in example and the second microblog account information belong to same ratio scope, if It is then to represent the subvalue of feature extraction result the 6th to be measured with numerical value 1, if it is not, then representing feature extraction result to be measured with numerical value 0 Six subvalues；

First composition subelement, for by the subvalue of feature extraction result first to be measured, the feature extraction knot to be measured It is the subvalue of fruit second, the subvalue of feature extraction result the 3rd to be measured, the subvalue of feature extraction result the 4th to be measured, described to be measured The subvalue of feature extraction result the 5th and the subvalue of feature extraction result the 6th to be measured form feature extraction end value to be measured.

A kind of microblog data processing system, including maximum entropy classifiers trainer and micro- as described in above-mentioned any one Rich data processing equipment, wherein, the maximum entropy classifiers trainer includes：

Acquiring unit, for obtaining multiple different positive class samples and multiple different negative class samples, the positive class sample Including two positive account informations, two positive account informations are respectively account letter of the same user in two different microblogging websites Breath, the negative class sample include two negative account letter informations, and two negative account informations belong to different user and its is each self-corresponding Account belongs to different microblogging websites, and two microblogging websites and the negative class sample are corresponding two corresponding to the positive class sample Microblogging website is identical, two microblogging websites corresponding to the sample to be tested, two microblogging websites phase corresponding with the just class sample Together, the positive account information comprises at least：ID, the pet name, sex, age, location and bean vermicelli user are with concern user's Ratio, the negative account information comprise at least：ID, the pet name, sex, age, location and bean vermicelli user and concern user Ratio；

Second feature extracting unit, for carrying out feature to each positive class sample and each negative class sample respectively Extract, obtain corresponding Positive training sample and negative training sample；

6th determining unit, for the number for determining to include in each Positive training sample and each negative training sample Value is characterized value；

3rd computing unit, for according to formulaEach characteristic value is calculated respectively When each y is respectively each+1 and -1, corresponding positive predicted condition probability and negative predicted condition probability, wherein, the y is to appoint One Positive training sample of meaning or any one negative training sample, the x are characterized value, P_λ(y | x) is predicted condition probability, exp () is the exponential function that natural number e is bottom, f_i() is binary feature function, describedλ_iIt is characterized function Value f_iThe weights of different characteristic functional value are identical corresponding to the weights of (x, y) and identical x,For to corresponding to each characteristic value The function that characteristic function value is summed,For to y is different value when the function summed of corresponding data, the λ_i's Known to initial value；

4th computing unit, for utilizing GIS algorithms, positive predicted condition probability corresponding to each characteristic value is adjusted, until Each respective positive predicted condition convergence in probability of characteristic value, and each convergent positive predicted condition probability is corresponding by each characteristic value Positive best initial weights of the λ as each self-corresponding characteristic function value of each characteristic value；

5th computing unit, for utilizing GIS algorithms, adjust and predicted condition probability is born corresponding to each characteristic value, until The respective negative predicted condition convergence in probability of each characteristic value, and each convergent negative predicted condition probability is corresponding by each characteristic value Negative best initial weights of the λ as each self-corresponding characteristic function value of each characteristic value.

Preferably, comprised at least in the positive account information：ID, the pet name, sex, age, location and bean vermicelli are used Family and the ratio of concern user, the negative account information comprise at least：ID, the pet name, sex, age, location and bean vermicelli In the case of ratio of the user with paying close attention to user, the second feature extracting unit includes：

6th judgment sub-unit, for whether judging the ID in each positive class respective two positive account informations of sample It is identical, if identical, the positive subvalue of feature extraction result first is represented with numerical value 1, if differing, positive feature is represented with numerical value 0 Extract the subvalue of result first；

7th judgment sub-unit, for judge the pet name in each positive class respective two positive account informations of sample whether phase Together, it is if identical, the positive subvalue of feature extraction result second is represented with numerical value 1, if differing, represents that positive feature is taken out with numerical value 0 Take the subvalue of result second；

8th judgment sub-unit, for judge the sex in each positive class respective two positive account informations of sample whether phase Together, it is if identical, the positive subvalue of feature extraction result the 3rd is represented with numerical value 1, if differing, represents that positive feature is taken out with numerical value 0 Take the subvalue of result the 3rd；

Second comparing subunit, for the age in more each positive class respective two positive account informations of sample, if two Age in individual positive account information is not filled in, and the positive subvalue of feature extraction result the 4th is represented with numerical value 0, if two positive account letters The age only having in breath in a positive account information has filled in, then represents the positive subvalue of feature extraction result the 4th with numerical value 1, if two Age in individual positive account information is identical, then the positive subvalue of feature extraction result the 4th is represented with numerical value 2, if two positive account informations In age differ, then represent the positive subvalue of feature extraction result the 4th with numerical value 3；

9th judgment sub-unit, for whether judging the location in each positive class respective two positive account informations of sample It is identical, if identical, the positive subvalue of feature extraction result the 5th is represented with numerical value 1, if differing, positive feature is represented with numerical value 0 Extract the subvalue of result the 5th；

Tenth judgment sub-unit, for judge bean vermicelli user in each positive class respective two positive account informations of sample and Whether the ratio of concern user belongs to same ratio scope, if so, the positive subvalue of feature extraction result the 6th then is represented with numerical value 1, If it is not, then represent the positive subvalue of feature extraction result the 6th with numerical value 0；

Second composition subelement, for each positive each self-corresponding positive feature extraction result first of class sample is sub Value, positive second subvalue of feature extraction result, the positive subvalue of feature extraction result the 3rd, the positive feature extraction result the Four subvalues, the positive subvalue of feature extraction result the 5th and the positive subvalue of feature extraction result the 6th form positive feature extraction knot Fruit value, as each self-corresponding Positive training sample of each positive class sample；

11st judgment sub-unit, for judging that the ID in the respective two negative account informations of each negative class sample is It is no identical, if identical, represent that negative feature extracts the subvalue of result first with numerical value 1, if differing, negative spy is represented with numerical value 0 Sign extracts the subvalue of result first；

12nd judgment sub-unit, for whether judging the pet name in the respective two negative account informations of each negative class sample It is identical, if identical, represent that negative feature extracts the subvalue of result second with numerical value 1, if differing, negative feature is represented with numerical value 0 Extract the subvalue of result second；

13rd judgment sub-unit, for whether judging the sex in the respective two negative account informations of each negative class sample It is identical, if identical, represent that negative feature extracts the subvalue of result the 3rd with numerical value 1, if differing, negative feature is represented with numerical value 0 Extract the subvalue of result the 3rd；

3rd comparing subunit, for the age in more each negative respective two negative account informations of class sample, if two Age in individual negative account information does not fill in, represents that negative feature extracts the subvalue of result the 4th with numerical value 0, if two negative account letters The age only having in breath in a negative account information has filled in, then represents that negative feature extracts the subvalue of result the 4th with numerical value 1, if two Age in individual negative account information is identical, then represents that negative feature extracts the subvalue of result the 4th with numerical value 2, if two negative account informations In age differ, then with numerical value 3 represent negative feature extract the subvalue of result the 4th；

13rd judgment sub-unit, for judging that the location in the respective two negative account informations of each negative class sample is It is no identical, if identical, represent that negative feature extracts the subvalue of result the 5th with numerical value 1, if differing, negative spy is represented with numerical value 0 Sign extracts the subvalue of result the 5th；

15th judgment sub-unit, for judging the bean vermicelli user in the respective two negative account informations of each negative class sample Whether the ratio with paying close attention to user belongs to same ratio scope, if so, then representing that negative feature extracts of result the 6th with numerical value 1 Value, if it is not, then representing that negative feature extracts the subvalue of result the 6th with numerical value 0；

3rd composition subelement, for each negative each self-corresponding negative feature of class sample to be extracted into of result first Value, the negative feature extract result second subvalue, the negative feature extracts the subvalue of result the 3rd, the negative feature extracts result the Four subvalues, the negative feature extract the subvalue of result the 5th and the negative feature extracts the subvalue of result the 6th and forms negative feature extraction knot Fruit value, as each negative each self-corresponding negative training sample of class sample.

Compared with prior art, the application has the beneficial effect that：

In this application, each characteristic value to be sorted is calculated using maximum entropy classifiers to be preset as just in the sample to be tested It is corresponding to be sorted just to predict sub- conditional probability and the sub- conditional probability of negative prediction to be sorted when class and negative class；Will be each to be sorted It is to be sorted corresponding to characteristic value just to predict that sub- conditional probability carries out multiplication, positive predicted condition probability to be sorted is obtained, will be each The negative sub- conditional probability of prediction to be sorted carries out multiplication corresponding to characteristic value to be sorted, obtains negative predicted condition probability to be sorted； Compare the size of the positive predicted condition probability to be sorted and the negative predicted condition probability to be sorted；It is described in comparative result In the case of positive predicted condition maximum probability to be sorted, determine the classification of the sample to be tested for just；It is described in comparative result In the case of negative predicted condition maximum probability to be sorted, the classification for determining the sample to be tested is negative, is realized using maximum entropy Prediction of the grader to sample to be tested classification.

It is timing predicting the classification of sample to be tested, determines that two accounts belong to same use corresponding to sample to be tested Family, when the classification for predicting sample to be tested is bears, determine that two accounts are not belonging to same user corresponding to sample to be tested, from And realize the identification to the same user under different microblogging websites.

Brief description of the drawings

In order to illustrate more clearly of the technical scheme in the embodiment of the present application, make required in being described below to embodiment Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present application, for For those of ordinary skill in the art, without having to pay creative labor, it can also be obtained according to these accompanying drawings His accompanying drawing.

Fig. 1 is a kind of flow chart for the microblog data processing method that the application provides；

Fig. 2 is a kind of flow chart of the training process for the maximum entropy classifiers that the application provides；

Fig. 3 is a kind of logical construction schematic diagram for the microblog data processing unit that the application provides；

Fig. 4 is a kind of logical construction schematic diagram for the microblog data processing system that the application provides；

Fig. 5 is a kind of logical construction schematic diagram for the maximum entropy classifiers trainer that the application provides.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete Site preparation describes, it is clear that described embodiment is only some embodiments of the present application, rather than whole embodiments.It is based on Embodiment in the application, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of the application protection.

Embodiment one

In the present embodiment, the microblog data processing method that the application provides is shown, refers to Fig. 1, it illustrates this A kind of flow chart of microblog data processing method provided is provided, may comprise steps of：

Step S11：Feature extraction is carried out to sample to be tested, obtains feature extraction end value to be measured.

Wherein, the sample to be tested is a pair of information that the first microblog account information and the second microblog account information form, Account institute corresponding to the affiliated microblogging website of account corresponding to the first microblog account information and the second microblog account information It is different to belong to microblogging website.For example, the first microblog account information is represented with a, the second microblog account information is represented with b, then treats test sample This is (a, b), and the affiliated microblogging website of account corresponding to a microblogging website affiliated with account corresponding to b is different, the account as corresponding to a Microblogging website belonging to number is Sina weibo website, and the affiliated microblogging website of account corresponding to b is Tengxun's microblogging website.

Step S12：Determine that each numerical value that the feature extraction end value to be measured is included is characteristic value to be sorted.

Step S13：Using maximum entropy classifiers, calculate each characteristic value to be sorted and be preset as positive class in the sample to be tested It is corresponding to be sorted just to predict sub- conditional probability and the sub- conditional probability of negative prediction to be sorted during with negative class.

Step S14：Just predicting that sub- conditional probability carries out multiplication by be sorted corresponding to each characteristic value to be sorted, obtaining Positive predicted condition probability to be sorted, the negative sub- conditional probability of prediction to be sorted corresponding to each characteristic value to be sorted is carried out multiplying fortune Calculate, obtain negative predicted condition probability to be sorted.

Step S15：Compare the size of the positive predicted condition probability to be sorted and the negative predicted condition probability to be sorted.

In the case where comparative result is the positive predicted condition maximum probability to be sorted, step S16 is performed；Comparing As a result in the case of being the negative predicted condition maximum probability to be sorted, step S17 is performed.

Step S16：Determine the classification of the sample to be tested for just.

Step S17：The classification for determining the sample to be tested is negative.

Step S18：It is timing in the classification of the sample to be tested, determines that two accounts belong to corresponding to the sample to be tested Same user.

Step S19：When the classification of the sample to be tested is bears, determine that two accounts do not belong to corresponding to the sample to be tested In same user.

Embodiment two

In the present embodiment, thus it is shown that using maximum entropy classifiers, calculate each characteristic value to be sorted described to be measured When sample is preset as positive class and negative class, corresponding sub- conditional probability and the sub- conditional probability of negative prediction to be sorted to be sorted just predicted Detailed process.

Using maximum entropy classifiers, calculate each characteristic value to be sorted and be preset as positive class and negative class in the sample to be tested When, it is corresponding it is to be sorted just predicting sub- conditional probability and the sub- conditional probability of negative prediction to be sorted process be specially：

Use maximum entropy objective function EquationEach spy to be sorted is calculated respectively Value indicative is respectively+1 and when -1 in y, it is corresponding it is to be sorted just predicting sub- conditional probability and the negative sub- conditional probability of prediction to be sorted, its In, the y is sample to be tested, and the x is characteristic value to be sorted, P_λ(y | x) it is the sub- conditional probability of prediction to be sorted, exp () is nature Count the exponential function that e is bottom, f_i() is binary feature function, describedλ_iIt is characterized functional value f_i(x,y) The weights of different characteristic functional value corresponding to positive best initial weights when y is+1 or the negative best initial weights when y is -1 and identical x It is identical,For the function summed to characteristic function value corresponding to each characteristic value to be sorted,For to y is different value when The function that corresponding data are summed.

The numerical value for presetting characteristic value during maximum entropy classifiers are trained, to include in training sample.

Now illustrate to according to formulaEach characteristic value to be sorted is calculated respectively in y When respectively+1 and -1, the corresponding process progress to be sorted for just predicting sub- conditional probability and the negative sub- conditional probability of prediction to be sorted Explanation.

For example, characteristic value to be sorted includes numerical value 0,1,1,3,1,1.And characteristic value is 0,1,2, then in evaluation 3 It is to be sorted just predicting sub- conditional probability and it is to be sorted it is negative prediction sub- conditional probability when, λ 0.

Corresponding to numerical value 0 in calculating characteristic value to be sorted it is to be sorted just predict sub- conditional probability when, λ be corresponding to 0 just Best initial weights, in the sub- conditional probability of negative prediction to be sorted corresponding to numerical value 0 in calculating characteristic value to be sorted, λ is to be born corresponding to 0 Best initial weights.

Corresponding to numerical value 1 in calculating characteristic value to be sorted it is to be sorted just predict sub- conditional probability when, λ be corresponding to 1 just Best initial weights, in the sub- conditional probability of negative prediction to be sorted corresponding to numerical value 1 in calculating characteristic value to be sorted, λ is to be born corresponding to 1 Best initial weights.

It is respectively+1 and -1 to calculating each characteristic value to be sorted respectively in y by taking numerical value 0 in characteristic value to be sorted as an example When, it is corresponding to be sorted just to predict that sub- conditional probability and the sub- conditional probability of negative prediction to be sorted illustrate.Make feature to be sorted Positive best initial weights are λ ' corresponding to numerical value 0 in value₁, it is λ ' to bear best initial weights₂.When y is+1, numerical value 0 exists in characteristic value to be sorted Corresponding characteristic function value is respectively f when the preset kind of sample to be tested is respectively positive and negative₁And f (1,0)_-1(1,0)；Y for- When 1, the corresponding characteristic function value difference when the preset kind of sample to be tested is respectively positive and negative of numerical value 0 in characteristic value to be sorted For f₁(- 1,0) and f_-1(-1,0)。

When y is+1, according to formulaIt can obtain The positive prediction to be sorted of numerical value 0 in characteristic value i.e. to be sorted Sub- conditional probability.

When y is -1, according to formulaIt can obtain The negative prediction to be sorted of numerical value 0 in characteristic value i.e. to be sorted Sub- conditional probability.

Each characteristic value to be sorted is each self-corresponding to be sorted just to predict sub- conditional probability and the negative sub- condition of prediction to be sorted The above-mentioned calculating process by taking numerical value 0 in characteristic value to be sorted as an example in the calculating process such as the present embodiment of probability, it is no longer superfluous herein State.

Embodiment three

In embodiment one and embodiment two, the first microblog account information and the second microblog account information are each It can comprise at least：ID (identity number, IDentity), the pet name, sex, age, location and bean vermicelli user with Pay close attention to the ratio of user, i.e. the first microblog account information can comprise at least：ID, the pet name, sex, the age, location and The ratio of bean vermicelli user and concern user；Second microblog account information can comprise at least：ID, the pet name, sex, the age, Location and the ratio of bean vermicelli user and concern user.

In the present embodiment, each comprised at least in the first microblog account information and the second microblog account information：With It is described that sample to be tested is entered in the case of the ratio of family ID, the pet name, sex, age, location and bean vermicelli user with paying close attention to user Row feature extraction, the process for obtaining feature extraction end value to be measured are specifically as follows：

A11：Judge the user in the ID and the second microblog account information in the first microblog account information Whether ID is identical, if identical, the subvalue of feature extraction result first to be measured is represented with numerical value 1, if differing, with the table of numerical value 0 Show the subvalue of feature extraction result first to be measured.

A12：Judging the pet name in the pet name in the first microblog account information and the second microblog account information is It is no identical, if identical, the subvalue of feature extraction result second to be measured is represented with numerical value 1, if differing, represents to treat with numerical value 0 Survey feature extraction the second subvalue of result.

A13：Judging the sex in the sex in the first microblog account information and the second microblog account information is It is no identical, if identical, the subvalue of feature extraction result the 3rd to be measured is represented with numerical value 1, if differing, represents to treat with numerical value 0 Survey the subvalue of feature extraction result the 3rd.

A14：Compare the age in the age and the second microblog account information in the first microblog account information, if The age in age and the second microblog account information in the first microblog account information does not fill in, with the table of numerical value 0 Show the subvalue of feature extraction result the 4th to be measured, if in the first microblog account information and the second microblog account information only Age in one microblog account information has filled in, then represents the subvalue of feature extraction result the 4th to be measured with numerical value 1, if described Age in one microblog account information is identical with the age in the second microblog account information, then represents spy to be measured with numerical value 2 Sign extracts the subvalue of result the 4th, if the year in the age and the second microblog account information in the first microblog account information Age differs, then represents the subvalue of feature extraction result the 4th to be measured with numerical value 3.

A15：Judge the place in the location and the second microblog account information in the first microblog account information Whether ground is identical, if identical, the subvalue of feature extraction result the 5th to be measured is represented with numerical value 1, if differing, with the table of numerical value 0 Show the subvalue of feature extraction result the 5th to be measured.

A16：Judge bean vermicelli user in the first microblog account information with paying close attention to the ratio of user and described second micro- Whether the ratio of bean vermicelli user and concern user in rich account information belong to same ratio scope, if so, then being represented with numerical value 1 The subvalue of feature extraction result the 6th to be measured, if it is not, then representing the subvalue of feature extraction result the 6th to be measured with numerical value 0.

In the present embodiment, bean vermicelli user can be divided into paying close attention to the preset ratio scope of user：[0,0.8], (0.8, 1.5), [1.5,3] and more than 3.

A17：By the subvalue of the feature extraction result first to be measured, subvalue of feature extraction result second to be measured, described The subvalue of feature extraction result the 3rd to be measured, the subvalue of feature extraction result the 4th to be measured, the feature extraction result to be measured Five subvalues and the subvalue of feature extraction result the 6th to be measured form feature extraction end value to be measured.

Example IV

In the present embodiment, thus it is shown that the training process of maximum entropy classifiers, refer to Fig. 2, it illustrates the application A kind of flow chart of the training process of the maximum entropy classifiers of offer, may comprise steps of：

Step S21：Obtain multiple different positive class samples and multiple different negative class samples.

Wherein, the positive class sample includes two positive account informations, and two positive account informations are respectively same user two Account information in individual different microblogging websites, the negative class sample include two negative account letter informations, two negative account information category Belong to different microblogging websites, two microblogging websites corresponding to the positive class sample in different user and its each self-corresponding account With the negative class sample corresponding to two microblogging websites it is identical, two microblogging websites corresponding to the sample to be tested and the positive class Two microblogging websites are identical corresponding to sample.

Two negative account informations belong to different user and its each self-corresponding account belongs to different microblogging website i.e. two Negative account information belongs to different user and two negative each self-corresponding accounts of account information belong to different microblogging websites.

In the present embodiment, the generating process of positive class sample and negative class sample specifically may refer to step B11 and step B12, it is as follows：

Step B11：Collect the account information in each comfortable two different microblogging websites of multiple sampling users.

Any one sampling user has an account in two different microblogging websites.Such as, user U1 is sampled in Sina Microblogging website Zhong Youyige Sinas account A, in Tengxun's microblogging website Zhong Youyige Tengxuns account B.

Now believed exemplified by sampling user U1 collecting any one account of sampling user in two different microblogging websites Breath illustrates, and the account information for such as sampling user U1 Sina account A is a, sampling user U1 Tengxun account B account letter Cease and take user U1 account information a and account information b for b, then collection.

Because the process for collecting the account information in each comfortable two different microblogging websites of each sampling user is identical, therefore The present embodiment only illustrates to the collection process of any one account information of the sampling user in two different microblogging websites, It is as follows：Collect account information of the sampling user in the first microblogging website and collect sampling user in the second microblogging website Account information, wherein the first microblogging website and the second microblogging website are different microblogging websites.

The process for collecting account information of the sampling user in the first microblogging website is：

C11：Build the first Subscriber Queue.

C12：Sampling user is added into the first Subscriber Queue.

C13：Sampling user is taken out from the first Subscriber Queue, the API provided by the first microblogging website (Application Programming Interface, application programming interface) extracts sampling user in the first microblogging Account information in website, and account information of the sampling user in the first microblogging website is added to the first Subscriber Queue In.

, can be from the first Subscriber Queue when subsequently using account information of the sampling user in the first microblogging website Extraction.

The process for collecting account information of the sampling user in the second microblogging website is：

D11：Build second user queue.

D12：Sampling user is added into second user queue.

D13：Sampling user is taken out from second user queue, the API provided by the second microblogging website extracts this and taken Account information of the sample user in the second microblogging website, and account information of the sampling user in the second microblogging website is added Into second user queue.

, can be from second user queue when subsequently using account information of the sampling user in the second microblogging website Extraction.

Step B12：The account information in each comfortable two different microblogging websites of each sampling user is partnered respectively Information, as positive class sample；From any two sample the respective account information of user in, will not belong to it is same sampling user and Two account informations in different microblogging websites form one group of information, as negative class sample.

The account information in each sampling user each comfortable two different microblogging websites is partnered information respectively, as Positive class sample is the process manually marked.

Sampled from any two in the respective account information of user, will not belong to same sampling user and in different microbloggings Two account informations in website form one group of information, as the negative class sample also process manually to mark.

For example, account informations of the sampling user U1 in two different microblogging websites is respectively a, b, user U2 is two for sampling Account information in individual different microblogging websites is respectively c, d, and account corresponding to account corresponding to a and c belongs to same microblogging net Stand, account corresponding to account corresponding to b and d belongs to same microblogging website, micro- belonging to account corresponding to account corresponding to a and c Rich website is different from the affiliated microblogging website of account corresponding to account corresponding to b and d, then (a, b) and (c, d) is positive class sample, (a, D) and (b, c) is negative class sample.

Step S22：Feature extraction is carried out to each positive class sample and each negative class sample respectively, obtained correspondingly Positive training sample and negative training sample.

Step S23：The numerical value for determining to include in each Positive training sample and each negative training sample is characterized Value.

In the present embodiment, the numerical value included in each Positive training sample and each negative training sample is real Apply default characteristic value involved in example two.

Step S24：According to formulaIt is each in each y that each characteristic value is calculated respectively From being respectively+1 and when -1, corresponding positive predicted condition probability and negative predicted condition probability.

Wherein, the y is any one Positive training sample or any one negative training sample, and the x is characterized value, P_λ(y | it is x) predicted condition probability, exp () is the exponential function that natural number e is bottom, f_i() is binary feature function, describedλ_iIt is characterized functional value f_iThe weights phase of different characteristic functional value corresponding to the weights of (x, y) and identical x Together,For the function summed to characteristic function value corresponding to each characteristic value,For to y is different value when corresponding number According to the function summed, the λ_iInitial value known to.

Due to λ_iInitial value, it is known that y is given value, therefore can be according to formulaEach characteristic value is calculated respectively when each y is respectively each+1 and -1, it is corresponding Positive predicted condition probability and negative predicted condition probability.

Now illustrate to according to formula, calculating each characteristic value respectively when each y is respectively each+1 and -1, it is corresponding just The process of predicted condition probability and negative predicted condition probability is described in detail.For example, two training samples, sequence number are respectively 1 and 2, the training sample of serial number 1 is Positive training sample, and Positive training sample includes numerical value 0,1,1,2,1,1, wherein numerical value 0 The training sample of corresponding serial number 2 is negative training sample, and negative training sample includes numerical value 0,0,0,1,0,0.

It is corresponding just pre- to being respectively+1 and when -1 in y by taking first numerical value 0 (i.e. numerical value corresponding to ID) as an example The process for surveying conditional probability and negative predicted condition probability illustrates.

0 (i.e. numerical value corresponding to ID) all exists in Positive training sample and in negative training sample, therefore is+1 in y When, the corresponding two characteristic function values of 0 (i.e. numerical value corresponding to ID), respectively f₁(1,0)、f₂(1,0), it is right when y is -1 Answer two characteristic function values, respectively f₁(-1,0)、f₂(-1,0).Due to the weights phase of different characteristic functional value corresponding to identical x Together, therefore f corresponding to 0 (i.e. numerical value corresponding to ID)₁(1,0) weights and corresponding f₂The weights of (1,0) are identical, are designated as λ₁； F corresponding to 0 (i.e. numerical value corresponding to ID)₁The weights of (- 1,0) and corresponding f₂The weights of (- 1,0) are identical, are designated as λ₂。

When y is+1, according to formulaIt can obtain

I.e. 0 (i.e. numerical value corresponding to ID) Positive predicted condition probability.

When y is -1, according to formulaIt can obtain I.e. 0 (counts i.e. corresponding to ID Value) negative predicted condition probability.

Due to λ_iInitial value, it is known that therefore λ₁And λ₂Value, it is known that P can be calculated_λ(1 | 0) and P_λ(-1|0)。

Each self-corresponding positive predicted condition probability of numerical value 1,1,2,1,1 and negative predicted condition that Positive training sample includes are general The calculating process of rate is such as positive predicted condition probability corresponding to above-mentioned 0 (i.e. numerical value corresponding to ID) and negative predicted condition probability Calculating process, it will not be repeated here.

The calculating process of each each self-corresponding positive predicted condition probability of characteristic value and negative predicted condition probability is also described above Calculating process by taking 0 (i.e. numerical value corresponding to ID) as an example, will not be repeated here.

Step S25：Using GIS algorithms, positive predicted condition probability corresponding to each characteristic value is adjusted, until each characteristic value Respective positive predicted condition convergence in probability, and using each characteristic value each λ corresponding to convergent positive predicted condition probability as often The positive best initial weights of individual each self-corresponding characteristic function value of characteristic value.

Using GIS algorithms, positive predicted condition probability corresponding to each characteristic value is adjusted, until each characteristic value is respective just The principle of predicted condition convergence in probability is existing principle, be will not be repeated here.

In the present embodiment, the respective positive predicted condition convergence in probability of each characteristic value is that each characteristic value is respective just pre- Survey conditional probability and reach maximum.

Step S26：Using GIS algorithms, adjust and predicted condition probability is born corresponding to each characteristic value, until each characteristic value Respective negative predicted condition convergence in probability, and using each characteristic value each λ corresponding to convergent negative predicted condition probability as every The negative best initial weights of individual each self-corresponding characteristic function value of characteristic value.

Using GIS algorithms, adjust and predicted condition probability is born corresponding to each characteristic value, until each characteristic value is respective negative The principle of predicted condition convergence in probability is existing principle, be will not be repeated here.

In the present embodiment, the respective negative predicted condition convergence in probability of each characteristic value is that each characteristic value is respective negative pre- Survey conditional probability and reach maximum.

The maximum entropy classifiers obtained after step S21- steps S26 training, it can be used for calculating each spy to be sorted Value indicative is when the sample to be tested is preset as positive class and negative class, the corresponding sub- conditional probability and to be sorted negative pre- to be sorted just predicted Sub- conditional probability is surveyed, detailed process is as shown in embodiment two.

In the present embodiment, positive account information can comprise at least：ID, the pet name, sex, age, location and powder The ratio of silk user and concern user, the negative account information can comprise at least：ID, the pet name, sex, age, place Ground and the ratio of bean vermicelli user and concern user.

Comprised at least in the positive account information：ID, the pet name, sex, age, location and bean vermicelli user and concern The ratio of user, the negative account information comprise at least：ID, the pet name, sex, age, location and bean vermicelli user are with closing It is described that are carried out by feature and is taken out for each positive class sample and each negative class sample respectively in the case of the ratio for noting user Take, obtain corresponding Positive training sample and the process of negative training sample, be specially：

E11：Judge whether the ID in each positive class respective two positive account informations of sample is identical, if identical, The positive subvalue of feature extraction result first is represented with numerical value 1, if differing, positive of feature extraction result first is represented with numerical value 0 Value.

E12：Judge whether the pet name in each positive class respective two positive account informations of sample is identical, if identical, uses Numerical value 1 represents the positive subvalue of feature extraction result second, if differing, the positive subvalue of feature extraction result second is represented with numerical value 0.

E13：Judge whether the sex in each positive class respective two positive account informations of sample is identical, if identical, uses Numerical value 1 represents the positive subvalue of feature extraction result the 3rd, if differing, the positive subvalue of feature extraction result the 3rd is represented with numerical value 0.

E14：Age in more each positive class respective two positive account informations of sample, if in two positive account informations Age is not filled in, and the positive subvalue of feature extraction result the 4th is represented with numerical value 0, if only having a positive account in two positive account informations Age in number information has filled in, then the positive subvalue of feature extraction result the 4th is represented with numerical value 1, if in two positive account informations Age is identical, then represents the positive subvalue of feature extraction result the 4th with numerical value 2, if the age in two positive account informations differs, Then the positive subvalue of feature extraction result the 4th is represented with numerical value 3.

E15：Judge whether the location in each positive class respective two positive account informations of sample is identical, if identical, The positive subvalue of feature extraction result the 5th is represented with numerical value 1, if differing, positive of feature extraction result the 5th is represented with numerical value 0 Value.

E16：Judge the ratio of the bean vermicelli user and concern user in each positive class respective two positive account informations of sample Whether same ratio scope is belonged to, if so, then the positive subvalue of feature extraction result the 6th is represented with numerical value 1, if it is not, then using numerical value 0 Represent the positive subvalue of feature extraction result the 6th.

E17：By the first subvalue of each positive each self-corresponding positive feature extraction result of class sample, the positive feature extraction As a result the second subvalue, the positive subvalue of feature extraction result the 3rd, the positive subvalue of feature extraction result the 4th, the positive feature Extract the subvalue of result the 5th and the positive subvalue of feature extraction result the 6th forms positive feature extraction end value, as each positive class Each self-corresponding Positive training sample of sample.

E18：Judge whether the ID in the respective two negative account informations of each negative class sample is identical, if identical, Represent that negative feature extracts the subvalue of result first with numerical value 1, if differing, represent that negative feature extracts of result first with numerical value 0 Value.

E19：Judge whether the pet name in the respective two negative account informations of each negative class sample is identical, if identical, uses Numerical value 1 represents that negative feature extracts the subvalue of result second, if differing, represents that negative feature extracts the subvalue of result second with numerical value 0.

E110：Judge whether the sex in the respective two negative account informations of each negative class sample is identical, if identical, uses Numerical value 1 represents that negative feature extracts the subvalue of result the 3rd, if differing, represents that negative feature extracts the subvalue of result the 3rd with numerical value 0.

E111：Age in more each respective two negative account informations of negative class sample, if in two negative account informations Age do not fill in, represent that negative feature extracts the subvalue of result the 4th with numerical value 0, if only have in two negative account informations one it is negative Age in account information has filled in, then represents that negative feature extracts the subvalue of result the 4th with numerical value 1, if in two negative account informations Age it is identical, then represent that negative feature extracts the subvalue of result the 4th with numerical value 2, if age in two negative account informations not phase Together, then represent that negative feature extracts the subvalue of result the 4th with numerical value 3.

E112：Judge whether the location in the respective two negative account informations of each negative class sample is identical, if identical, Represent that negative feature extracts the subvalue of result the 5th with numerical value 1, if differing, represent that negative feature extracts of result the 5th with numerical value 0 Value.

E113：Judge the ratio of the bean vermicelli user and concern user in the respective two negative account informations of each negative class sample Whether same ratio scope is belonged to, if so, then representing that negative feature extracts the subvalue of result the 6th with numerical value 1, if it is not, then using numerical value 0 Represent that negative feature extracts the subvalue of result the 6th.

E114：By each negative each self-corresponding negative feature of class sample extracts the subvalue of result first, the negative feature is taken out The subvalue of result second, the negative feature is taken to extract the subvalue of result the 3rd, negative feature extraction result the 4th subvalue, the negative spy Sign extracts the subvalue of result the 5th and the negative feature extracts the subvalue of result the 6th and forms negative feature extraction end value, as each negative Each self-corresponding negative training sample of class sample.

In the present embodiment, now illustrate and step E11- steps E17 process is illustrated, for example, user U1 is at two Positive account information a and b in different microblogging websites form positive class sample (a, b), with reference to table 1 to how to align class sample (a, b) Feature extraction is carried out, Positive training sample is obtained and illustrates.

Table 1

As shown in Table 1, the positive subvalue of feature extraction result first is 0, and the positive subvalue of feature extraction result second is 1, positive feature It is 1 to extract the subvalue of result the 3rd, and the positive subvalue of feature extraction result the 4th is 2, and the positive subvalue of feature extraction result the 5th is 1, positive special It is 1 that sign, which extracts the subvalue of result the 6th, then positive feature extraction end value is a line numerical value, i.e. { 0,1,1,2,1,1 }.

In the above-described embodiments, the affiliated microblogging website of account corresponding to the first microblog account information can with but do not limit to In for Sina weibo website, the affiliated microblogging website of account corresponding to the second microblog account information can with but be not limited to rise Interrogate microblogging website.

Embodiment five

It is corresponding with above method embodiment, a kind of microblog data processing unit is present embodiments provided, refers to Fig. 3, A kind of logical construction schematic diagram of the microblog data processing unit provided it illustrates the application, microblog data processing unit bag Include：Fisrt feature extracting unit 31, the first determining unit 32, the first computing unit 33, the second computing unit 34, comparing unit 35th, the second determining unit 36, the 3rd determining unit 37, the 4th determining unit 38 and the 5th determining unit 39.

Fisrt feature extracting unit 31, for carrying out feature extraction to sample to be tested, feature extraction end value to be measured is obtained, Wherein, a pair of information that the sample to be tested forms for the first microblog account information and the second microblog account information, described first The affiliated microblogging net of account corresponding to the affiliated microblogging website of account corresponding to microblog account information and the second microblog account information Stand difference.

First determining unit 32, for determining that each numerical value that the feature extraction end value to be measured is included is to be sorted Characteristic value.

First computing unit 33, for using maximum entropy classifiers, calculating each characteristic value to be sorted and treating test sample described It is corresponding to be sorted just to predict sub- conditional probability and the sub- conditional probability of negative prediction to be sorted when being originally preset as positive class and negative class.

Second computing unit 34, for just predicting that sub- conditional probability is carried out by be sorted corresponding to each characteristic value to be sorted Multiplication, positive predicted condition probability to be sorted is obtained, the negative sub- condition of prediction to be sorted corresponding to each characteristic value to be sorted is general Rate carries out multiplication, obtains negative predicted condition probability to be sorted.

Comparing unit 35, for the positive predicted condition probability to be sorted and the negative predicted condition probability to be sorted Size, in the case where comparative result is the positive predicted condition maximum probability to be sorted, the second determining unit 36 of triggering is true The classification of the fixed sample to be tested is just, in the case where comparative result is the negative predicted condition maximum probability to be sorted, to touch Send out the 3rd determining unit 37 and determine that the classification of the sample to be tested is negative.

4th determining unit 38, it is timing for the classification in the sample to be tested, determines corresponding to the sample to be tested Two accounts belong to same user.

5th determining unit 39, for when the classification of the sample to be tested is bears, determining corresponding to the sample to be tested Two accounts are not belonging to same user.

In the present embodiment, the first computing unit 33 specifically includes：Computation subunit, for using maximum entropy object function FormulaEach characteristic value to be sorted is calculated respectively when y is respectively+1 and -1, it is corresponding It is to be sorted just predicting sub- conditional probability and the sub- conditional probability of negative prediction to be sorted, wherein, the y be sample to be tested, and the x is to treat point Category feature value, P_λ(y | x) is the sub- conditional probability of prediction to be sorted, and exp () is the exponential function that natural number e is bottom, f_i() is two Value tag function, it is describedλ_iIt is characterized functional value f_i(x, y) y be+1 when positive best initial weights or in y For -1 when negative best initial weights and identical x corresponding to different characteristic functional value weights it is identical,For to each feature to be sorted The function that characteristic function value corresponding to value is summed,For to y is different value when the function summed of corresponding data.

In said apparatus, each at least wrapped in the first microblog account information and the second microblog account information Include：User Identity number ID, the pet name, sex, age, location and bean vermicelli user and the situation of the ratio of concern user Under, the fisrt feature extracting unit 31 specifically includes：

First judgment sub-unit, for judging the ID in the first microblog account information and the second microblogging account Whether the ID in number information is identical, if identical, the subvalue of feature extraction result first to be measured is represented with numerical value 1, if not phase Together, then the subvalue of feature extraction result first to be measured is represented with numerical value 0.

Second judgment sub-unit, for judging the pet name in the first microblog account information and second microblog account Whether the pet name in information is identical, if identical, the subvalue of feature extraction result second to be measured is represented with numerical value 1, if differing, The subvalue of feature extraction result second to be measured is represented with numerical value 0.

3rd judgment sub-unit, for judging the sex in the first microblog account information and second microblog account Whether the sex in information is identical, if identical, the subvalue of feature extraction result the 3rd to be measured is represented with numerical value 1, if differing, The subvalue of feature extraction result the 3rd to be measured is represented with numerical value 0.

First comparing subunit, for the age in the first microblog account information and second microblog account Age in information, if the age in age and the second microblog account information in the first microblog account information is not Fill in, the subvalue of feature extraction result the 4th to be measured is represented with numerical value 0, if the first microblog account information and second microblogging The age only having in account information in a microblog account information has filled in, then represents feature extraction result to be measured the with numerical value 1 Four subvalues, if the age in the first microblog account information is identical with the age in the second microblog account information, use Numerical value 2 represents the subvalue of feature extraction result the 4th to be measured, if age in the first microblog account information and described second micro- Age in rich account information differs, then represents the subvalue of feature extraction result the 4th to be measured with numerical value 3.

4th judgment sub-unit, for judging the location in the first microblog account information and the second microblogging account Whether the location in number information is identical, if identical, the subvalue of feature extraction result the 5th to be measured is represented with numerical value 1, if not phase Together, then the subvalue of feature extraction result the 5th to be measured is represented with numerical value 0.

5th judgment sub-unit, for the ratio for judging the bean vermicelli user in the first microblog account information with paying close attention to user Whether the ratio of bean vermicelli user and concern user in example and the second microblog account information belong to same ratio scope, if It is then to represent the subvalue of feature extraction result the 6th to be measured with numerical value 1, if it is not, then representing feature extraction result to be measured with numerical value 0 Six subvalues.

Embodiment six

In the present embodiment, a kind of microblog data processing system is shown, refers to Fig. 4, it illustrates the application offer Microblog data processing system a kind of logical construction schematic diagram, microblog data processing system includes：Maximum entropy classifiers are trained Device 41 and microblog data processing unit 42.

Microblog data processing unit of the concrete structure of microblog data processing unit 42 as shown in embodiment five, herein no longer Repeat.

In the present embodiment, the concrete structure of maximum entropy classifiers trainer 41 refers to Fig. 5, and it illustrates the application A kind of logical construction schematic diagram of the maximum entropy classifiers trainer of offer, maximum entropy classifiers trainer include：Obtain Unit 51, second feature extracting unit 52, the 6th determining unit 53, the 3rd computing unit 54, the 4th computing unit 55 and the 5th Computing unit 56.

Acquiring unit 51, for obtaining multiple different positive class samples and multiple different negative class samples.

The positive class sample includes two positive account informations, and two positive account informations are respectively same user in two differences Account information in microblogging website, the negative class sample include two negative account letter informations, and two negative account informations belong to different User and its each self-corresponding account belong to different microblogging websites, two microblogging websites corresponding to the positive class sample with it is described Two microblogging websites corresponding to negative class sample are identical, two microblogging websites corresponding to the sample to be tested and the positive class sample pair The two microblogging websites answered are identical, and the positive account information comprises at least：ID, the pet name, sex, age, location and powder The ratio of silk user and concern user, the negative account information comprise at least：ID, the pet name, sex, the age, location and The ratio of bean vermicelli user and concern user.

Second feature extracting unit 52, each positive class sample and each negative class sample are carried out for respectively special Sign extracts, and obtains corresponding Positive training sample and negative training sample.

6th determining unit 53, for determining what is included in each Positive training sample and each negative training sample Numerical value is characterized value.

3rd computing unit 54, for according to formulaCalculate respectively each special Value indicative is when each y is respectively each+1 and -1, corresponding positive predicted condition probability and negative predicted condition probability, wherein, the y For any one Positive training sample or any one negative training sample, the x is characterized value, P_λ(y | x) it is predicted condition probability, Exp () is the exponential function that natural number e is bottom, f_i() is binary feature function, describedλ_iIt is characterized letter Numerical value f_iThe weights of different characteristic functional value are identical corresponding to the weights of (x, y) and identical x,To be corresponded to each characteristic value The function summed of characteristic function value,For to y is different value when the function summed of corresponding data, the λ_i Initial value known to.

4th computing unit 55, for utilizing GIS algorithms, positive predicted condition probability corresponding to each characteristic value is adjusted, directly To the respective positive predicted condition convergence in probability of each characteristic value, and by the respective convergent just predicted condition probability pair of each characteristic value Positive best initial weights of the λ answered as each self-corresponding characteristic function value of each characteristic value.

5th computing unit 56, for utilizing GIS algorithms, adjust and predicted condition probability is born corresponding to each characteristic value, directly To the respective negative predicted condition convergence in probability of each characteristic value, and by the respective convergent negative predicted condition probability pair of each characteristic value Negative best initial weights of the λ answered as each self-corresponding characteristic function value of each characteristic value.

In the present embodiment, comprised at least in the positive account information：ID, the pet name, sex, the age, location and The ratio of bean vermicelli user and concern user, the negative account information comprise at least：ID, the pet name, sex, age, location In the case of ratio of the bean vermicelli user with paying close attention to user, the second feature extracting unit 52 specifically includes：

6th judgment sub-unit, for whether judging the ID in each positive class respective two positive account informations of sample It is identical, if identical, the positive subvalue of feature extraction result first is represented with numerical value 1, if differing, positive feature is represented with numerical value 0 Extract the subvalue of result first.

7th judgment sub-unit, for judge the pet name in each positive class respective two positive account informations of sample whether phase Together, it is if identical, the positive subvalue of feature extraction result second is represented with numerical value 1, if differing, represents that positive feature is taken out with numerical value 0 Take the subvalue of result second.

8th judgment sub-unit, for judge the sex in each positive class respective two positive account informations of sample whether phase Together, it is if identical, the positive subvalue of feature extraction result the 3rd is represented with numerical value 1, if differing, represents that positive feature is taken out with numerical value 0 Take the subvalue of result the 3rd.

Second comparing subunit, for the age in more each positive class respective two positive account informations of sample, if two Age in individual positive account information is not filled in, and the positive subvalue of feature extraction result the 4th is represented with numerical value 0, if two positive account letters The age only having in breath in a positive account information has filled in, then represents the positive subvalue of feature extraction result the 4th with numerical value 1, if two Age in individual positive account information is identical, then the positive subvalue of feature extraction result the 4th is represented with numerical value 2, if two positive account informations In age differ, then represent the positive subvalue of feature extraction result the 4th with numerical value 3.

9th judgment sub-unit, for whether judging the location in each positive class respective two positive account informations of sample It is identical, if identical, the positive subvalue of feature extraction result the 5th is represented with numerical value 1, if differing, positive feature is represented with numerical value 0 Extract the subvalue of result the 5th.

Tenth judgment sub-unit, for judge bean vermicelli user in each positive class respective two positive account informations of sample and Whether the ratio of concern user belongs to same ratio scope, if so, the positive subvalue of feature extraction result the 6th then is represented with numerical value 1, If it is not, then represent the positive subvalue of feature extraction result the 6th with numerical value 0.

Second composition subelement, for each positive each self-corresponding positive feature extraction result first of class sample is sub Value, positive second subvalue of feature extraction result, the positive subvalue of feature extraction result the 3rd, the positive feature extraction result the Four subvalues, the positive subvalue of feature extraction result the 5th and the positive subvalue of feature extraction result the 6th form positive feature extraction knot Fruit value, as each self-corresponding Positive training sample of each positive class sample.

11st judgment sub-unit, for judging that the ID in the respective two negative account informations of each negative class sample is It is no identical, if identical, represent that negative feature extracts the subvalue of result first with numerical value 1, if differing, negative spy is represented with numerical value 0 Sign extracts the subvalue of result first.

12nd judgment sub-unit, for whether judging the pet name in the respective two negative account informations of each negative class sample It is identical, if identical, represent that negative feature extracts the subvalue of result second with numerical value 1, if differing, negative feature is represented with numerical value 0 Extract the subvalue of result second.

13rd judgment sub-unit, for whether judging the sex in the respective two negative account informations of each negative class sample It is identical, if identical, represent that negative feature extracts the subvalue of result the 3rd with numerical value 1, if differing, negative feature is represented with numerical value 0 Extract the subvalue of result the 3rd.

3rd comparing subunit, for the age in more each negative respective two negative account informations of class sample, if two Age in individual negative account information does not fill in, represents that negative feature extracts the subvalue of result the 4th with numerical value 0, if two negative account letters The age only having in breath in a negative account information has filled in, then represents that negative feature extracts the subvalue of result the 4th with numerical value 1, if two Age in individual negative account information is identical, then represents that negative feature extracts the subvalue of result the 4th with numerical value 2, if two negative account informations In age differ, then with numerical value 3 represent negative feature extract the subvalue of result the 4th.

13rd judgment sub-unit, for judging that the location in the respective two negative account informations of each negative class sample is It is no identical, if identical, represent that negative feature extracts the subvalue of result the 5th with numerical value 1, if differing, negative spy is represented with numerical value 0 Sign extracts the subvalue of result the 5th.

It should be noted that each embodiment in this specification is described by the way of progressive, each embodiment weight Point explanation is all difference with other embodiment, between each embodiment identical similar part mutually referring to. For device class embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is joined See the part explanation of embodiment of the method.

Finally, it is to be noted that, herein, such as first and second or the like relational terms be used merely to by One entity or operation make a distinction with another entity or operation, and not necessarily require or imply these entities or operation Between any this actual relation or order be present.Moreover, term " comprising ", "comprising" or its any other variant meaning Covering including for nonexcludability, so that process, method, article or equipment including a series of elements not only include that A little key elements, but also the other element including being not expressly set out, or also include for this process, method, article or The intrinsic key element of equipment.In the absence of more restrictions, the key element limited by sentence "including a ...", is not arranged Except other identical element in the process including the key element, method, article or equipment being also present.

For convenience of description, it is divided into various units during description apparatus above with function to describe respectively.Certainly, this is being implemented The function of each unit can be realized in same or multiple softwares and/or hardware during application.

As seen through the above description of the embodiments, those skilled in the art can be understood that the application can Realized by the mode of software plus required general hardware platform.Based on such understanding, the technical scheme essence of the application On the part that is contributed in other words to prior art can be embodied in the form of software product, the computer software product It can be stored in storage medium, such as ROM/RAM, magnetic disc, CD, including some instructions are causing a computer equipment (can be personal computer, server, either network equipment etc.) performs some of each embodiment of the application or embodiment Method described in part.

A kind of microblog data processing method provided herein, apparatus and system are described in detail above, this Apply specific case in text to be set forth the principle and embodiment of the application, the explanation of above example is only intended to Help understands the present processes and its core concept；Meanwhile for those of ordinary skill in the art, the think of according to the application Think, in specific embodiments and applications there will be changes, in summary, this specification content should not be construed as pair The limitation of the application.

Claims

A kind of 1. microblog data processing method, it is characterised in that including：

Feature extraction is carried out to sample to be tested, obtains feature extraction end value to be measured, wherein, the sample to be tested is the first microblogging A pair of information of account information and the second microblog account information composition, it is micro- belonging to account corresponding to the first microblog account information Rich website microblogging website affiliated with account corresponding to the second microblog account information is different；

Determine that each numerical value that the feature extraction end value to be measured is included is characteristic value to be sorted；

Using maximum entropy classifiers, each characteristic value to be sorted is calculated when the sample to be tested is preset as positive class and negative class, it is right That answers to be sorted is just predicting sub- conditional probability and the sub- conditional probability of negative prediction to be sorted；

Just predicting that sub- conditional probability carries out multiplication by be sorted corresponding to each characteristic value to be sorted, obtaining positive prediction to be sorted Conditional probability, the negative sub- conditional probability of prediction to be sorted corresponding to each characteristic value to be sorted is subjected to multiplication, obtained to be sorted Negative predicted condition probability；

Compare the size of the positive predicted condition probability to be sorted and the negative predicted condition probability to be sorted；

In the case where comparative result is the positive predicted condition maximum probability to be sorted, the classification for determining the sample to be tested is Just；

In the case where comparative result is the negative predicted condition maximum probability to be sorted, the classification for determining the sample to be tested is It is negative；

It is timing in the classification of the sample to be tested, determines that two accounts belong to same user corresponding to the sample to be tested；

When the classification of the sample to be tested is bears, determine that two accounts are not belonging to same user corresponding to the sample to be tested.
2. according to the method for claim 1, it is characterised in that it is described to use maximum entropy classifiers, calculate each to be sorted Characteristic value is when the sample to be tested is preset as positive class and negative class, the corresponding sub- conditional probability and to be sorted negative to be sorted just predicted The process of sub- conditional probability is predicted, including：

Use maximum entropy objective function EquationEach characteristic value to be sorted is calculated respectively Be respectively+1 and when -1 in y, it is corresponding it is to be sorted just predicting sub- conditional probability and the negative sub- conditional probability of prediction to be sorted, wherein, The y is sample to be tested, and the x is characteristic value to be sorted, P_λ(y | x) it is the sub- conditional probability of prediction to be sorted, exp () is natural number e For the exponential function at bottom, f_i() is binary feature function, describedλ_iIt is characterized functional value f_i(x, y) is in y For+1 when positive best initial weights or negative best initial weights when y is -1 and identical x corresponding to different characteristic functional value weights phase Together,For the function summed to characteristic function value corresponding to each characteristic value to be sorted,For to y is different value when pair The function that the data answered are summed；

Wherein, the y is that+1 expression sample to be tested is preset as positive class, and the y is that -1 expression sample to be tested is preset as Negative class, each characteristic function value corresponding to each characteristic value to be sorted correspond to respectively the preset kind of the sample to be tested just and It is negative, calculate it is to be sorted just predict sub- conditional probability when, if characteristic value to be sorted is included in default characteristic value, λ treats for this Positive best initial weights corresponding to characteristic of division value, otherwise λ is 0, when calculating the sub- conditional probability of negative prediction to be sorted, if spy to be sorted Value indicative is included in the default characteristic value, then λ is that best initial weights are born corresponding to the characteristic value to be sorted, and otherwise λ is 0.
3. method according to claim 1 or 2, it is characterised in that the first microblog account information and described second micro- Rich account information each comprises at least：

The ratio of User Identity number ID, the pet name, sex, age, location and bean vermicelli user and concern user.
4. according to the method for claim 3, it is characterised in that it is described that feature extraction is carried out to sample to be tested, obtain to be measured The process of feature extraction end value, including：

Judge ID in the ID and the second microblog account information in the first microblog account information whether phase Together, it is if identical, the subvalue of feature extraction result first to be measured is represented with numerical value 1, if differing, spy to be measured is represented with numerical value 0 Sign extracts the subvalue of result first；

Judge whether the pet name in the pet name and the second microblog account information in the first microblog account information is identical, if It is identical, then the subvalue of feature extraction result second to be measured is represented with numerical value 1, if differing, feature extraction to be measured is represented with numerical value 0 As a result the second subvalue；

Judge whether the sex in the sex and the second microblog account information in the first microblog account information is identical, if It is identical, then the subvalue of feature extraction result the 3rd to be measured is represented with numerical value 1, if differing, feature extraction to be measured is represented with numerical value 0 As a result the 3rd subvalue；

Compare the age in the age and the second microblog account information in the first microblog account information, if described first The age in age and the second microblog account information in microblog account information is not filled in, and spy to be measured is represented with numerical value 0 Sign extracts the subvalue of result the 4th, if only having a microblogging in the first microblog account information and the second microblog account information Age in account information has filled in, then represents the subvalue of feature extraction result the 4th to be measured with numerical value 1, if the first microblogging account Age in number information is identical with the age in the second microblog account information, then represents feature extraction knot to be measured with numerical value 2 The subvalue of fruit the 4th, if age in age and the second microblog account information in the first microblog account information not phase Together, then the subvalue of feature extraction result the 4th to be measured is represented with numerical value 3；

Judge location in the location and the second microblog account information in the first microblog account information whether phase Together, it is if identical, the subvalue of feature extraction result the 5th to be measured is represented with numerical value 1, if differing, spy to be measured is represented with numerical value 0 Sign extracts the subvalue of result the 5th；

Judge that the bean vermicelli user in the first microblog account information believes with the ratio of concern user and second microblog account Whether the ratio of bean vermicelli user and concern user in breath belong to same ratio scope, if so, then representing feature to be measured with numerical value 1 The subvalue of result the 6th is extracted, if it is not, then representing the subvalue of feature extraction result the 6th to be measured with numerical value 0；

By the subvalue of feature extraction result first to be measured, the subvalue of feature extraction result second to be measured, the feature to be measured Extract result the 3rd subvalue, the subvalue of feature extraction result the 4th to be measured, the subvalue of feature extraction result the 5th to be measured and The subvalue of feature extraction result the 6th to be measured forms feature extraction end value to be measured.
5. according to the method for claim 1, it is characterised in that the training process of the maximum entropy classifiers includes：

Multiple different positive class samples and multiple different negative class samples are obtained, the positive class sample includes two positive account letters Breath, two positive account informations are respectively account information of the same user in two different microblogging websites, the negative class sample bag Two negative account letter informations are included, two negative account informations belong to different user and its each self-corresponding account belongs to different microbloggings Website, two microblogging websites corresponding to the positive class sample are identical with two microblogging websites corresponding to the negative class sample, described Two microblogging websites corresponding to sample to be tested are identical with two microblogging websites corresponding to the positive class sample；

Feature extraction is carried out to each positive class sample and each negative class sample respectively, obtains corresponding Positive training sample And negative training sample；

The numerical value for determining to include in each Positive training sample and each negative training sample is characterized value；

According to formulaIt is respectively each+1 that each characteristic value is calculated respectively in each y During with -1, corresponding positive predicted condition probability and negative predicted condition probability；

Wherein, the y is any one Positive training sample or any one negative training sample, and the x is characterized value, P_λ(yx) it is Predicted condition probability, exp () are the exponential function that natural number e is bottom, f_i() is binary feature function, describedλ_iIt is characterized functional value f_iThe weights phase of different characteristic functional value corresponding to the weights of (x, y) and identical x Together,For the function summed to characteristic function value corresponding to each characteristic value,For to y is different value when corresponding number According to the function summed, the λ_iInitial value known to；

Using GIS (general iterative, generalized iterative scaling) algorithm, adjust corresponding to each characteristic value Positive predicted condition probability, until the respective positive predicted condition convergence in probability of each characteristic value, and each characteristic value is each restrained Positive predicted condition probability corresponding to positive best initial weights of the λ as each self-corresponding characteristic function value of each characteristic value；

Using the GIS algorithms, adjust and predicted condition probability is born corresponding to each characteristic value, until each characteristic value is respective negative Predicted condition convergence in probability, and using each characteristic value each λ corresponding to convergent negative predicted condition probability as each characteristic value The negative best initial weights of each self-corresponding characteristic function value.
6. according to the method for claim 5, it is characterised in that comprised at least in the positive account information：It is ID, close The ratio of title, sex, age, location and bean vermicelli user and concern user, the negative account information comprise at least：ID, The pet name, sex, age, location and bean vermicelli user with concern user ratio in the case of, it is described respectively to it is each it is described just Class sample and each negative class sample carry out feature extraction, obtain the process bag of corresponding Positive training sample and negative training sample Include：

Judge whether the ID in each positive class respective two positive account informations of sample is identical, if identical, with the table of numerical value 1 Show the positive subvalue of feature extraction result first, if differing, the positive subvalue of feature extraction result first is represented with numerical value 0；

Judge whether the pet name in each positive class respective two positive account informations of sample is identical, if identical, is represented with numerical value 1 Positive feature extraction the second subvalue of result, if differing, the positive subvalue of feature extraction result second is represented with numerical value 0；

Judge whether the sex in each positive class respective two positive account informations of sample is identical, if identical, is represented with numerical value 1 The positive subvalue of feature extraction result the 3rd, if differing, the positive subvalue of feature extraction result the 3rd is represented with numerical value 0；

Age in more each positive class respective two positive account informations of sample, if the age in two positive account informations is not Fill in, the positive subvalue of feature extraction result the 4th is represented with numerical value 0, if only having in two positive account informations in a positive account information Age filled in, then the positive subvalue of feature extraction result the 4th is represented with numerical value 1, if the age phase in two positive account informations Together, then the positive subvalue of feature extraction result the 4th is represented with numerical value 2, if the age in two positive account informations differs, uses number Value 3 represents the positive subvalue of feature extraction result the 4th；

Judge whether the location in each positive class respective two positive account informations of sample is identical, if identical, with the table of numerical value 1 Show the positive subvalue of feature extraction result the 5th, if differing, the positive subvalue of feature extraction result the 5th is represented with numerical value 0；

Judge the bean vermicelli user in each positive class respective two positive account informations of sample with paying close attention to whether the ratio of user belongs to Same ratio scope, if so, then the positive subvalue of feature extraction result the 6th is represented with numerical value 1, if it is not, then representing positive spy with numerical value 0 Sign extracts the subvalue of result the 6th；

By the first subvalue of each positive each self-corresponding positive feature extraction result of class sample, the positive feature extraction result second Subvalue, the positive subvalue of feature extraction result the 3rd, the positive subvalue of feature extraction result the 4th, the positive feature extraction result 5th subvalue and the positive subvalue of feature extraction result the 6th form positive feature extraction end value, as each positive class sample each Corresponding Positive training sample；

Judge whether the ID in the respective two negative account informations of each negative class sample is identical, if identical, with the table of numerical value 1 Show that negative feature extracts the subvalue of result first, if differing, represent that negative feature extracts the subvalue of result first with numerical value 0；

Judge whether the pet name in the respective two negative account informations of each negative class sample is identical, if identical, is represented with numerical value 1 Negative feature extracts the subvalue of result second, if differing, represents that negative feature extracts the subvalue of result second with numerical value 0；

Judge whether the sex in the respective two negative account informations of each negative class sample is identical, if identical, is represented with numerical value 1 Negative feature extracts the subvalue of result the 3rd, if differing, represents that negative feature extracts the subvalue of result the 3rd with numerical value 0；

Age in more each respective two negative account informations of negative class sample, if the age in two negative account informations is not Fill in, represent that negative feature extracts the subvalue of result the 4th with numerical value 0, if only having in two negative account informations in a negative account information Age filled in, then with numerical value 1 represent negative feature extract the subvalue of result the 4th, if the age phase in two negative account informations Together, then represent that negative feature extracts the subvalue of result the 4th with numerical value 2, if the age in two negative account informations differs, use number Value 3 represents that negative feature extracts the subvalue of result the 4th；

Judge whether the location in the respective two negative account informations of each negative class sample is identical, if identical, with the table of numerical value 1 Show that negative feature extracts the subvalue of result the 5th, if differing, represent that negative feature extracts the subvalue of result the 5th with numerical value 0；

Judge the bean vermicelli user in the respective two negative account informations of each negative class sample with paying close attention to whether the ratio of user belongs to Same ratio scope, if so, then representing that negative feature extracts the subvalue of result the 6th with numerical value 1, if it is not, then representing negative spy with numerical value 0 Sign extracts the subvalue of result the 6th；

By each negative each self-corresponding negative feature of class sample extracts the subvalue of result first, the negative feature extracts result second Subvalue, the negative feature extract the subvalue of result the 3rd, the negative feature extracts the subvalue of result the 4th, the negative feature extracts result 5th subvalue and the negative feature extract the subvalue of result the 6th and form negative feature extraction end value, as each negative class sample each Corresponding negative training sample.
A kind of 7. microblog data processing unit, it is characterised in that including：

Fisrt feature extracting unit, for carrying out feature extraction to sample to be tested, feature extraction end value to be measured is obtained, wherein, A pair of information that the sample to be tested forms for the first microblog account information and the second microblog account information, the first microblogging account Number affiliated microblogging website of account corresponding to information microblogging website affiliated with account corresponding to the second microblog account information is different；

First determining unit, for determining that each numerical value that the feature extraction end value to be measured is included is feature to be sorted Value；

First computing unit, for using maximum entropy classifiers, calculating each characteristic value to be sorted and being preset in the sample to be tested For positive class and negative class when, it is corresponding to be sorted just to predict sub- conditional probability and the sub- conditional probability of negative prediction to be sorted；

Second computing unit, for just predicting that sub- conditional probability carries out multiplying fortune by be sorted corresponding to each characteristic value to be sorted Calculate, obtain positive predicted condition probability to be sorted, the negative sub- conditional probability of prediction to be sorted corresponding to each characteristic value to be sorted is entered Row multiplication, obtain negative predicted condition probability to be sorted；

Comparing unit, for the big of the positive predicted condition probability to be sorted and the negative predicted condition probability to be sorted It is small, in the case where comparative result is the positive predicted condition maximum probability to be sorted, described in triggering the second determining unit determination The classification of sample to be tested is just, in the case where comparative result is the negative predicted condition maximum probability to be sorted, to trigger the 3rd Determining unit determines that the classification of the sample to be tested is negative；

4th determining unit, it is timing for the classification in the sample to be tested, determines two accounts corresponding to the sample to be tested Number belong to same user；

5th determining unit, for when the classification of the sample to be tested is bears, determining two accounts corresponding to the sample to be tested Number it is not belonging to same user.
8. device according to claim 7, it is characterised in that first computing unit includes：

Computation subunit, for using maximum entropy objective function EquationCalculate respectively every Individual characteristic value to be sorted is respectively+1 and when -1 in y, corresponding to be sorted just to predict sub- conditional probability and negative prediction to be sorted Conditional probability, wherein, the y is sample to be tested, and the x is characteristic value to be sorted, P_λ(y | x) it is the sub- conditional probability of prediction to be sorted, Exp () is the exponential function that natural number e is bottom, f_i() is binary feature function, describedλ_iIt is characterized letter Numerical value f_iDifferent characteristic corresponding to the positive best initial weights of (x, y) when y is+1 or the negative best initial weights when y is -1 and identical x The weights of functional value are identical,For the function summed to characteristic function value corresponding to each characteristic value to be sorted,To be right The function that corresponding data are summed when y is different value；

Wherein, the y is that+1 expression sample to be tested is preset as positive class, and the y is that -1 expression sample to be tested is preset as Negative class, each characteristic function value corresponding to each characteristic value to be sorted correspond to respectively the preset kind of the sample to be tested just and It is negative, calculate it is to be sorted just predict sub- conditional probability when, if characteristic value to be sorted is included in default characteristic value, λ treats for this Positive best initial weights corresponding to characteristic of division value, otherwise λ is 0, when calculating the sub- conditional probability of negative prediction to be sorted, if spy to be sorted Value indicative is included in the default characteristic value, then λ is that best initial weights are born corresponding to the characteristic value to be sorted, and otherwise λ is 0.
9. the device according to claim 7 or 8, it is characterised in that in the first microblog account information and described second Microblog account information each comprises at least：User Identity number ID, the pet name, sex, age, location and bean vermicelli user with In the case of the ratio for paying close attention to user, the fisrt feature extracting unit includes：

First judgment sub-unit, for judging ID and second microblog account letter in the first microblog account information Whether the ID in breath is identical, if identical, the subvalue of feature extraction result first to be measured is represented with numerical value 1, if differing, The subvalue of feature extraction result first to be measured is represented with numerical value 0；

Second judgment sub-unit, for judging the pet name in the first microblog account information and the second microblog account information In the pet name it is whether identical, if identical, represent the subvalue of feature extraction result second to be measured with numerical value 1, if differing, use number Value 0 represents the subvalue of feature extraction result second to be measured；

3rd judgment sub-unit, for judging the sex in the first microblog account information and the second microblog account information In sex it is whether identical, if identical, represent the subvalue of feature extraction result the 3rd to be measured with numerical value 1, if differing, use number Value 0 represents the subvalue of feature extraction result the 3rd to be measured；

First comparing subunit, for the age in the first microblog account information and the second microblog account information In age, if the age in age and the second microblog account information in the first microblog account information do not fill out Write, the subvalue of feature extraction result the 4th to be measured is represented with numerical value 0, if the first microblog account information and the second microblogging account Age in number information in an only microblog account information has filled in, then represents feature extraction result the 4th to be measured with numerical value 1 Subvalue, if the age in the first microblog account information is identical with the age in the second microblog account information, use number Value 2 represents the subvalue of feature extraction result the 4th to be measured, if age and second microblogging in the first microblog account information Age in account information differs, then represents the subvalue of feature extraction result the 4th to be measured with numerical value 3；

4th judgment sub-unit, for judging the location in the first microblog account information and second microblog account letter Whether the location in breath is identical, if identical, the subvalue of feature extraction result the 5th to be measured is represented with numerical value 1, if differing, The subvalue of feature extraction result the 5th to be measured is represented with numerical value 0；

5th judgment sub-unit, for judge the bean vermicelli user in the first microblog account information with pay close attention to user ratio and Whether the ratio of bean vermicelli user and concern user in the second microblog account information belong to same ratio scope, if so, then The subvalue of feature extraction result the 6th to be measured is represented with numerical value 1, if it is not, then representing of feature extraction result the 6th to be measured with numerical value 0 Value；

First composition subelement, for by the subvalue of feature extraction result first to be measured, the feature extraction result to be measured the Two subvalues, the subvalue of feature extraction result the 3rd to be measured, the subvalue of feature extraction result the 4th to be measured, the feature to be measured Extract the subvalue of result the 5th and the subvalue of feature extraction result the 6th to be measured forms feature extraction end value to be measured.
10. a kind of microblog data processing system, it is characterised in that including maximum entropy classifiers trainer and such as claim 7- Microblog data processing unit described in 9 any one, wherein, the maximum entropy classifiers trainer includes：

Acquiring unit, for obtaining multiple different positive class samples and multiple different negative class samples, the positive class sample includes Two positive account informations, two positive account informations are respectively account information of the same user in two different microblogging websites, institute Stating negative class sample includes two negative account letter informations, and two negative account informations belong to different user and its each self-corresponding account category In different microblogging websites, two microblogging websites corresponding to the positive class sample, two microblogging nets corresponding with the negative class sample Stand identical, two microblogging websites corresponding to the sample to be tested are identical with two microblogging websites corresponding to the positive class sample, institute Positive account information is stated to comprise at least：The ratio of ID, the pet name, sex, age, location and bean vermicelli user and concern user, The negative account information comprises at least：The ratio of ID, the pet name, sex, age, location and bean vermicelli user and concern user Example；

Second feature extracting unit, for each positive class sample and each negative class sample to be carried out feature and taken out respectively Take, obtain corresponding Positive training sample and negative training sample；

6th determining unit, the numerical value for determining to include in each Positive training sample and each negative training sample are Characteristic value；

3rd computing unit, for according to formulaEach characteristic value is calculated respectively each When individual y is respectively each+1 and -1, corresponding positive predicted condition probability and negative predicted condition probability, wherein, the y is any one Individual Positive training sample or any one negative training sample, the x are characterized value, P_λ(y | x) it is predicted condition probability, exp () is Natural number e be bottom exponential function, f_i() is binary feature function, describedλ_iIt is characterized functional value f_i(x, Y) weights of different characteristic functional value are identical corresponding to weights and identical x,For to characteristic function corresponding to each characteristic value The function that value is summed,For to y is different value when the function summed of corresponding data, the λ_iInitial value Know；

4th computing unit, for utilizing GIS (general iterative, generalized iterative scaling) algorithm, adjustment Positive predicted condition probability corresponding to each characteristic value, until the respective positive predicted condition convergence in probability of each characteristic value, and will be every Individual characteristic value each λ corresponding to convergent positive predicted condition probability as each self-corresponding characteristic function value of each characteristic value just Best initial weights；

5th computing unit, for utilizing the GIS algorithms, adjust and predicted condition probability is born corresponding to each characteristic value, until The respective negative predicted condition convergence in probability of each characteristic value, and each convergent negative predicted condition probability is corresponding by each characteristic value Negative best initial weights of the λ as each self-corresponding characteristic function value of each characteristic value.
11. system according to claim 10, it is characterised in that comprised at least in the positive account information：It is ID, close The ratio of title, sex, age, location and bean vermicelli user and concern user, the negative account information comprise at least：ID, In the case of the ratio of the pet name, sex, age, location and bean vermicelli user with paying close attention to user, the second feature extracting unit Including：

6th judgment sub-unit, for judge the ID in each positive class respective two positive account informations of sample whether phase Together, it is if identical, the positive subvalue of feature extraction result first is represented with numerical value 1, if differing, represents that positive feature is taken out with numerical value 0 Take the subvalue of result first；

7th judgment sub-unit, for judging whether the pet name in each positive class respective two positive account informations of sample is identical, If identical, the positive subvalue of feature extraction result second is represented with numerical value 1, if differing, positive feature extraction knot is represented with numerical value 0 The subvalue of fruit second；

8th judgment sub-unit, for judging whether the sex in each positive class respective two positive account informations of sample is identical, If identical, the positive subvalue of feature extraction result the 3rd is represented with numerical value 1, if differing, positive feature extraction knot is represented with numerical value 0 The subvalue of fruit the 3rd；

Second comparing subunit, for the age in more each positive class respective two positive account informations of sample, if two just Age in account information is not filled in, and the positive subvalue of feature extraction result the 4th is represented with numerical value 0, if in two positive account informations Age in only one positive account information has filled in, then represents the positive subvalue of feature extraction result the 4th with numerical value 1, if two just Age in account information is identical, then represents the positive subvalue of feature extraction result the 4th with numerical value 2, if in two positive account informations Age differs, then represents the positive subvalue of feature extraction result the 4th with numerical value 3；

9th judgment sub-unit, for judge the location in each positive class respective two positive account informations of sample whether phase Together, it is if identical, the positive subvalue of feature extraction result the 5th is represented with numerical value 1, if differing, represents that positive feature is taken out with numerical value 0 Take the subvalue of result the 5th；

Tenth judgment sub-unit, for judging the bean vermicelli user in each positive class respective two positive account informations of sample and concern Whether the ratio of user belongs to same ratio scope, if so, the positive subvalue of feature extraction result the 6th then is represented with numerical value 1, if it is not, Then the positive subvalue of feature extraction result the 6th is represented with numerical value 0；

Second composition subelement, for will each positive class sample each self-corresponding just the first subvalue of feature extraction result, the institute State the positive subvalue of feature extraction result second, the positive subvalue of feature extraction result the 3rd, positive of feature extraction result the 4th Value, the positive subvalue of feature extraction result the 5th and the positive subvalue of feature extraction result the 6th form positive feature extraction result Value, as each self-corresponding Positive training sample of each positive class sample；

11st judgment sub-unit, for judge the ID in the respective two negative account informations of each negative class sample whether phase Together, it is if identical, represent that negative feature extracts the subvalue of result first with numerical value 1, if differing, represent that negative feature is taken out with numerical value 0 Take the subvalue of result first；

12nd judgment sub-unit, for judge the pet name in the respective two negative account informations of each negative class sample whether phase Together, it is if identical, represent that negative feature extracts the subvalue of result second with numerical value 1, if differing, represent that negative feature is taken out with numerical value 0 Take the subvalue of result second；

13rd judgment sub-unit, for judge the sex in the respective two negative account informations of each negative class sample whether phase Together, it is if identical, represent that negative feature extracts the subvalue of result the 3rd with numerical value 1, if differing, represent that negative feature is taken out with numerical value 0 Take the subvalue of result the 3rd；

3rd comparing subunit, for the age in more each negative respective two negative account informations of class sample, if two negative Age in account information does not fill in, represents that negative feature extracts the subvalue of result the 4th with numerical value 0, if in two negative account informations Age in an only negative account information has filled in, then represents that negative feature extracts the subvalue of result the 4th with numerical value 1, if two negative Age in account information is identical, then represents that negative feature extracts the subvalue of result the 4th with numerical value 2, if in two negative account informations Age differs, then represents that negative feature extracts the subvalue of result the 4th with numerical value 3；

13rd judgment sub-unit, for judge the location in the respective two negative account informations of each negative class sample whether phase Together, it is if identical, represent that negative feature extracts the subvalue of result the 5th with numerical value 1, if differing, represent that negative feature is taken out with numerical value 0 Take the subvalue of result the 5th；

15th judgment sub-unit, for judging the bean vermicelli user in the respective two negative account informations of each negative class sample with closing Whether the ratio of note user belongs to same ratio scope, if so, then represent that negative feature extracts the subvalue of result the 6th with numerical value 1, if It is no, then represent that negative feature extracts the subvalue of result the 6th with numerical value 0；

3rd composition subelement, for each negative each self-corresponding negative feature of class sample to be extracted into the subvalue of result first, institute State negative feature and extract the subvalue of result second, negative feature extraction result the 3rd subvalue, of negative feature extraction result the 4th Value, the negative feature extract the subvalue of result the 5th and the negative feature extracts the subvalue of result the 6th and forms negative feature extraction result Value, as each negative each self-corresponding negative training sample of class sample.