CN106021235B

CN106021235B - A kind of data mining processing method and device

Info

Publication number: CN106021235B
Application number: CN201610387322.5A
Authority: CN
Inventors: 黄引刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Priority date: 2016-06-01
Filing date: 2016-06-01
Publication date: 2019-01-29
Anticipated expiration: 2036-06-01
Also published as: CN106021235A

Abstract

The embodiment of the invention discloses a kind of data mining processing method and devices, wherein method includes: to obtain multiple user's remark informations corresponding with user to be excavated, and mining analysis goes out at least one candidate user remark information in the multiple user's remark information, and by identical candidate user remark information at least one described candidate user remark information respectively as candidate real name；According to the corresponding phonetic of each candidate user remark information, count the corresponding frequency of occurrence of each identical phonetic, and according to each identical corresponding frequency of occurrence of phonetic and the corresponding frequency of occurrence of each candidate's real name, the corresponding posterior probability of each candidate's real name is calculated；Using the candidate real name of maximum posterior probability as the optimal real name of the user to be excavated.Using the present invention, it can accurately analyze and identify user's real name, to enrich the function of social networks.

Description

A kind of data mining processing method and device

Technical field

The present invention relates to Internet technical field more particularly to a kind of data mining processing method and devices.

Background technique

With the development of internet technology, more and more users can participate in social networks.User is being added to social activity Before network, need first to carry out user's registration, and the user name registered can be the character that user arbitrarily inputs, i.e., user infuses The real name of user can not included in volume information.And to carry out security monitoring in social networks, then need the real name of user Just can recognize that whether user is fraudulent user；For another example to carry out accurate crowd's excavation in social networks, then it is also required to use To the real name of user.But for current social networks, can only independently be provided by user to obtain the real name of user, and work as When user is reluctant to provide real name, the server side of social networks is can not to learn the real name of the user, so as to cause social networks Partial function can not fully achieve.

Summary of the invention

The embodiment of the present invention provides a kind of data mining processing method and device, can accurately analyze and identify user's reality Name, to enrich the function of social networks.

The embodiment of the invention provides a kind of data mining processing methods, comprising:

Multiple user's remark informations corresponding with user to be excavated are obtained, and are excavated in the multiple user's remark information At least one candidate user remark information is analyzed, and candidate identical at least one described candidate user remark information is used Family remark information is respectively as candidate real name；

According to the corresponding phonetic of each candidate user remark information, the corresponding appearance frequency of each identical phonetic is counted It is secondary, and according to the corresponding frequency of occurrence of each identical phonetic and the corresponding frequency of occurrence of each candidate real name, Calculate the corresponding posterior probability of each candidate's real name；

Using the candidate real name of maximum posterior probability as the optimal real name of the user to be excavated.

Correspondingly, the embodiment of the invention also provides a kind of data mining processing units, comprising:

It obtains and excavates module, for obtaining multiple user's remark informations corresponding with user to be excavated, and the multiple Mining analysis goes out at least one candidate user remark information in user's remark information, and will at least one described candidate user remarks Identical candidate user remark information is respectively as candidate real name in information；

Computing module, for counting each identical phonetic difference according to the corresponding phonetic of each candidate user remark information Corresponding frequency of occurrence, and respectively corresponded according to each identical corresponding frequency of occurrence of phonetic and each candidate real name Frequency of occurrence, calculate the corresponding posterior probability of each candidate real name；

Determining module, for using the candidate real name of maximum posterior probability as the optimal real name of the user to be excavated.

The embodiment of the present invention by multiple user's remark informations mining analysis go out at least one candidate user remarks letter Breath, and will at least one described candidate user remark information identical candidate user remark information respectively as candidate real name, And according to the corresponding phonetic of each candidate user remark information, the corresponding frequency of occurrence of each identical phonetic, and root are counted According to the corresponding frequency of occurrence of each identical phonetic and the corresponding frequency of occurrence of each candidate real name, described in calculating The corresponding posterior probability of each candidate's real name, finally using the candidate real name of maximum posterior probability as the user to be excavated Optimal real name, use is accurately analyzed based on user's remark information in the case where user does not provide real name so as to realize The real name at family, and then the various functions of social networks can be enriched based on the real name analyzed.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is a kind of flow diagram of data mining processing method provided in an embodiment of the present invention；

Fig. 2 is the flow diagram of another data mining processing method provided in an embodiment of the present invention；

Fig. 3 is the flow diagram of another data mining processing method provided in an embodiment of the present invention；

Fig. 4 is the flow diagram of another data mining processing method provided in an embodiment of the present invention；

Fig. 5 is a kind of structural schematic diagram of data mining processing unit provided in an embodiment of the present invention；

Fig. 6 is a kind of structural schematic diagram for obtaining excavation module provided in an embodiment of the present invention；

Fig. 7 is a kind of structural schematic diagram of computing module provided in an embodiment of the present invention；

Fig. 8 is a kind of structural schematic diagram of first probability calculation unit provided in an embodiment of the present invention；

Fig. 9 is a kind of structural schematic diagram of determining module provided in an embodiment of the present invention；

Figure 10 is the structural schematic diagram of another data mining processing unit provided in an embodiment of the present invention；

Figure 11 is a kind of structural schematic diagram of server provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

It referring to Figure 1, is a kind of flow diagram of data mining processing method provided in an embodiment of the present invention, the side Method may include:

S101 obtains multiple user's remark informations corresponding with user to be excavated, and in the multiple user's remark information Middle mining analysis goes out at least one candidate user remark information, and will be identical at least one described candidate user remark information Candidate user remark information is respectively as candidate real name；

Specifically, the available multiple user's remarks letters corresponding with user to be excavated of the server based on social networks Breath, wherein the user to be excavated refers to that server need to analyze the user for identifying its true real name, and the multiple user is standby Note information refers to that other good friend users carry out the information of remarks to the user to be excavated.For example, the user to be excavated has 100 A good friend user, 100 good friend users have 75 good friend users to carry out remarks to the user to be excavated, then can be 75 by this The information of good friend institute remarks is as the multiple user's remark information.The server is further believed in the multiple user's remarks Mining analysis goes out at least one candidate user remark information in breath, and will be identical at least one described candidate user remark information Candidate user remark information respectively as candidate real name.For example, having 20 at least one described candidate user remark information It is " yellow AC ", 15 candidate user remark informations that candidate user remark information, which is " king AB ", 3 candidate user remark informations, It is " yellow AB ", 30 candidate user remark informations is " king AC ", then it can be by " king AB ", " yellow AC ", " yellow AB ", " king AC " As the candidate real name.

Wherein, server mining analysis in the multiple user's remark information goes out at least one candidate user remarks The detailed process of information can be with are as follows: obtains multiple user's remark informations corresponding with user to be excavated, and according to name structure rule Then filtered out in the multiple user's remark information with preset surname matching list meet surname condition first kind user it is standby Infuse information；It will include user's remark information deletion of proper noun and/or high frequency words in the first kind user remark information, and First kind user's remark information remaining after deletion is determined as at least one candidate user remark information.Wherein, described special Having noun may include such as teacher, master worker, sir, the proprietary role's word of Miss, the high frequency words may include as tomorrow, after It, have a meal, drink water the contour existing word that occurs frequently.For example, can be determined if some first kind user's remark information is " teacher Wang " First kind user's remark information includes therefore proper noun can delete first kind user's remark information.

Wherein, the name tactical rule can criticize the number of words of normal name, and such as normal name is generally 2 to 4 Chinese characters (name of monosyllabic name is 2 to 3 Chinese characters, and the name of two-character surname is 3 to 4 Chinese characters).Therefore, described according to name tactical rule and pre- If surname matching list the first kind user's remark information for meeting surname condition is filtered out in the multiple user's remark information Detailed process can be with are as follows: the server can first based on effective word cutting algorithm to the multiple users got for information progress Word cutting (if user's remark information is " he is king AB ", then user's remark information after word cutting becomes " king AB "), then will include 2 User's remark information after to the word cutting of 4 Chinese characters screens, and preliminary screening user's remark information is obtained, later further according to pre- If surname matching list in monosyllabic name set preliminary screening user's remark information comprising 2 words is matched, with detect packet First Chinese character of preliminary screening user's remark information containing 2 words whether there is in the monosyllabic name set, and if it exists, then really Surely preliminary screening user's remark information comprising 2 words meets surname condition and as first kind user's remark information, no Then rejected；And it is standby to the preliminary screening user comprising 4 words according to the two-character surname set in preset surname matching list simultaneously Note information is matched, and whether there is with detecting the first two Chinese character of preliminary screening user's remark information comprising 4 words in institute It states in two-character surname set, and if it exists, then determine that preliminary screening user's remark information comprising 4 words meets surname condition and made For first kind user's remark information, otherwise rejected；And simultaneously according to the monosyllabic name set and the two-character surname set to including 3 Preliminary screening user's remark information of a word matches, to detect the of preliminary screening user's remark information comprising 3 words One Chinese character, which whether there is, whether there is in the monosyllabic name set or the first two Chinese character in the two-character surname set, as long as detecting full The one of condition of foot, it can determine that preliminary screening user's remark information comprising 3 words meets surname condition and made For first kind user's remark information, rejected if being all unsatisfactory for.

It is corresponding out to count each identical phonetic according to the corresponding phonetic of each candidate user remark information by S102 The existing frequency, and according to each identical corresponding frequency of occurrence of phonetic and the corresponding appearance frequency of each candidate's real name It is secondary, calculate the corresponding posterior probability of each candidate's real name；

Specifically, each candidate user remarks at least one available described candidate user remark information of the server The corresponding full pinyin of information, the full pinyin include surname phonetic and name phonetic.For example, some candidate user remarks is believed Breath is " Zhang Xiaobo ", then corresponding full pinyin is " zhang xiaobo ", wherein surname phonetic is " zhang ", and name phonetic is "xiaobo".The server is corresponding out further according to each identical surname phonetic of each candidate user remark information statistics The existing frequency and the corresponding frequency of occurrence of each same name phonetic, for example, at least one described candidate user remark information packet 20 " Zhang Xiaobo ", 25 " Zhang Xiaobo ", 10 " Wang Xiafangs " and 5 " Zhang Haibo " are included, then can learn that identical surname is spelled Sound includes " zhang " and " wang ", and same name phonetic includes " xiaobo " and " haibo ", mutually of the same surname so as to count The frequency of occurrence of family name's phonetic " zhang " is 50, and the frequency of occurrence of identical surname phonetic " wang " is 10, same name phonetic The frequency of occurrence of " xiaobo " is 55, and the frequency of occurrence of same name phonetic " haibo " is 5.Hereafter, the server further according to The corresponding frequency of occurrence of each identical surname phonetic, the corresponding frequency of occurrence of each same name phonetic and candidate user Remark information total amount calculates the corresponding joint probability of each identical full pinyin, and according to the identical complete of maximum joint probability The corresponding frequency of occurrence of phonetic and the corresponding frequency of occurrence of each candidate's real name, calculate each candidate real name difference Corresponding posterior probability.

Wherein, described to be respectively corresponded according to the corresponding frequency of occurrence of each identical surname phonetic, each same name phonetic Frequency of occurrence and candidate user remark information total amount, calculate the specific mistake of the corresponding joint probability of each identical full pinyin Journey are as follows: according to the corresponding frequency of occurrence of each identical surname phonetic and candidate user remark information total amount, calculate described each Corresponding first probability of identical surname phonetic；According to the corresponding frequency of occurrence of each same name phonetic and candidate use Family remark information total amount calculates corresponding second probability of each same name phonetic；To each first probability and respectively Second probability is calculated, to obtain the corresponding joint probability of each identical full pinyin.

Wherein, the calculation formula of the joint probability are as follows: joint probability P_{Full pinyin}=P_{Surname phonetic}*P_{Name phonetic}, P_{Surname phonetic}It is as described First probability, P_{Name phonetic}As described second probability.The calculation formula of the posterior probability are as follows: and posterior probability P (candidate real name | most Good full pinyin)=candidate real name in best full pinyin frequency of occurrence/best full pinyin frequency of occurrence, it is described best complete Phonetic refers to the identical full pinyin of maximum joint probability, wherein if the full pinyin of candidate real name is not the best full pinyin, Then the frequency of occurrence of candidate's real name in best full pinyin is 0.For example, at least one described candidate user remark information packet 30 " Wu Xiaobo ", 20 " Wu little Bo ", 10 " Wu Xiaobo ", 10 " Zhang Xiaobo " and 30 " Zhang Haibo " are included, wherein phase Include " wu xiaobo ", " zhang xiaobo ", " zhang haibo " with full pinyin, wherein identical surname phonetic " wu " P_{Surname phonetic}The frequency of occurrence of=" wu "/candidate user remark information total amount=60/100, the P of identical surname phonetic " zhang "_{Surname phonetic} The frequency of occurrence of=" zhang "/candidate user remark information total amount=40/100, the P of same name phonetic " xiaobo "_{Name phonetic} The frequency of occurrence of=" xiaobo "/candidate user remark information total amount=70/100, the P of same name phonetic " haibo "_{Name phonetic} The frequency of occurrence of=" haibo "/candidate user remark information total amount=30/100；So as to calculate identical full pinyin " wu The joint probability P of xiaobo "_{Full pinyin}The P of=identical surname phonetic " wu "_{Surname phonetic}* the P of same name phonetic " xiaobo "_{Name phonetic}= 42/100, the joint probability P of identical full pinyin " zhang xiaobo "_{Full pinyin}The P of=identical surname phonetic " zhang "_{Surname phonetic}* phase With the P of name phonetic " xiaobo "_{Name phonetic}=28/100, the joint probability P of identical full pinyin " zhang haibo "_{Full pinyin}=identical The P of surname phonetic " zhang "_{Surname phonetic}* the P of same name phonetic " haibo "_{Name phonetic}=12/100；It can be seen that identical full pinyin The joint probability of " wu xiaobo " is maximum, therefore, identical full pinyin " wu xiaobo " is used as best full pinyin；Further may be used To calculate the posterior probability P (Wu Xiaobo | best full pinyin " wu xiaobo ")=30/60 of " Wu Xiaobo ", after " Wu little Bo " Test probability P (Wu little Bo | best full pinyin " wu xiaobo ")=20/60, and the posterior probability P of " Wu Xiaobo " (Wu Xiaobo | it is best Full pinyin " wu xiaobo ")=10/60, the posterior probability P of " Zhang Xiaobo " (Zhang Xiaobo | best full pinyin " wu xiaobo ")= 0, the posterior probability P of " Zhang Haibo " (Zhang Haibo | best full pinyin " wu xiaobo ")=0.

S103, using the candidate real name of maximum posterior probability as the optimal real name of the user to be excavated；

It, can will be maximum specifically, after the server calculates the corresponding posterior probability of each candidate real name Posterior probability optimal real name of the candidate real name as the user to be excavated, it can the optimal real name is determined as institute State user to be excavated is really real name, so as to realize that the real name to user accurately identifies.For example, candidate real name includes " Wu Xiaobo ", " Wu little Bo ", " Wu Xiaobo ", " Zhang Xiaobo ", " Zhang Haibo ", wherein the posterior probability of " Wu Xiaobo " be 30/60, The posterior probability of " Wu little Bo " be 20/60, " Wu Xiaobo " posterior probability be 10/60, " Zhang Xiaobo " posterior probability be 0, " The posterior probability of hypo " is 0, then " Wu Xiaobo " of maximum posterior probability can be determined as the optimal of the user to be excavated Real name.

Fig. 2 is referred to, is the flow diagram of another data mining processing method provided in an embodiment of the present invention, it is described Method may include:

S201 obtains multiple user's remark informations corresponding with user to be excavated, and in the multiple user's remark information Middle mining analysis goes out at least one candidate user remark information, and will be identical at least one described candidate user remark information Candidate user remark information is respectively as candidate real name；

It is corresponding out to count each identical phonetic according to the corresponding phonetic of each candidate user remark information by S202 The existing frequency, and according to each identical corresponding frequency of occurrence of phonetic and the corresponding appearance frequency of each candidate's real name It is secondary, calculate the corresponding posterior probability of each candidate's real name；

Wherein, the specific implementation of S201 to S202 step may refer to the S101 in above-mentioned Fig. 1 corresponding embodiment extremely S102 is not discussed here.

S203, judges whether maximum posterior probability is greater than predetermined probabilities threshold value；

Specifically, after the server calculates the corresponding posterior probability of each candidate's real name, it can be further Judge whether maximum posterior probability is greater than predetermined probabilities threshold value.

S204, using the candidate real name of the maximum posterior probability as the optimal real name of the user to be excavated；

Specifically, illustrating that the maximum posterior probability has enough confidence levels therefore can if S203 is judged as YES Using the optimal real name by the candidate real name of the maximum posterior probability as the user to be excavated, to guarantee the optimal reality Name is exactly the real real name of the user to be excavated.

S205 is modified the corresponding posterior probability of each candidate real name according to preset tune power rule, and Using the candidate real name of maximum revised posterior probability as the optimal real name of the user to be excavated；

Specifically, the server can weigh rule to each candidate according to preset tune if S203 is judged as NO The corresponding posterior probability of real name is modified, and using the candidate real name of maximum revised posterior probability as it is described to Excavate the optimal real name of user.It is described to adjust mapping relations of the frequency of occurrence with corrected parameter, the phase for weighing that rule includes: candidate real name With the weight and the mapping relations of corrected parameter of full pinyin, the mapping relations of the character complexity of candidate real name and corrected parameter, In the mapping relations of the mapping relations of the character length of candidate real name and corrected parameter, the popularity of surname and corrected parameter At least one mapping relations.The frequency of occurrence of candidate's real name and the mapping relations of corrected parameter refer to multiple and different appearance Mapping relations between frequency range and multiple and different corrected parameters, the corresponding bigger amendment ginseng of bigger frequency of occurrence range Number, and be then negative for the corresponding corrected parameter of frequency of occurrence range lower than frequency threshold value, the appearance frequency of such as candidate real name A The secondary frequency of occurrence than candidate real name B is more, then the corresponding corrected parameter of candidate real name A is bigger, i.e. the corresponding posteriority of candidate's real name A Probability will will increase more numerical value；For another example the frequency of occurrence of candidate real name C is lower than frequency threshold value, then needs to reduce candidate real name The corresponding posterior probability of C.The weight of the identical full pinyin and the mapping relations of corrected parameter refer to multiple and different weight models Enclose the mapping relations between multiple and different corrected parameters, the corresponding bigger corrected parameter of bigger proportion range, and for Corrected parameter corresponding lower than the proportion range of weight threshold then can be negative, and the quantity occupancy family such as certain identical full pinyin is standby The ratio for infusing informational capacity is bigger, then the weight of the identical full pinyin is bigger, then the corresponding corrected parameter of the identical full pinyin is just It is bigger, it can to rise to the corresponding posterior probability of multiple candidate's real names of the identical full pinyin.Candidate's real name Character complexity and the mapping relations of corrected parameter refer to multiple and different character complexity and multiple and different corrected parameters it Between mapping relations, the corresponding bigger corrected parameter of bigger character complexity, if some candidate real name includes to be difficult to write and not The Chinese character of common (i.e. biggish character complexity), then candidate's real name can correspond to biggish corrected parameter, it can substantially Improve the corresponding posterior probability of candidate's real name.The character length of candidate's real name and the mapping relations of corrected parameter refer to more Mapping relations between a different character length and multiple and different corrected parameters, longer character length correspond to bigger repair The character length of positive parameter, such as candidate real name A is greater than the character length of candidate real name B, then candidate real name A can be corresponded to bigger Corrected parameter, it can more greatly improve the corresponding posterior probability of candidate real name A.The popularity and amendment of the surname The mapping relations of parameter refer to the mapping relations between multiple and different surname popularitys and multiple and different corrected parameters, more The corresponding corrected parameter of universal surname is bigger, and then can be with for the corresponding corrected parameter of surname lower than popularity threshold value For negative, as the corresponding corrected parameter of surname " king " corrected parameter more corresponding than surname " Ouyang " is big.Therefore, the server can It is right respectively to each candidate real name to weigh the combination of one of rule mapping relations or a variety of mapping relations according to the tune The posterior probability answered is modified (modified process can be increase posterior probability, be also possible to reduce posterior probability), and will Optimal real name of the candidate real name of maximum revised posterior probability as the user to be excavated.

The embodiment of the present invention by multiple user's remark informations mining analysis go out at least one candidate user remarks letter Breath, and will at least one described candidate user remark information identical candidate user remark information respectively as candidate real name, And according to the corresponding phonetic of each candidate user remark information, the corresponding frequency of occurrence of each identical phonetic, and root are counted According to the corresponding frequency of occurrence of each identical phonetic and the corresponding frequency of occurrence of each candidate real name, described in calculating The corresponding posterior probability of each candidate's real name, and when maximum posterior probability is greater than predetermined probabilities threshold value, can will be maximum Posterior probability optimal real name of the candidate real name as the user to be excavated, do not provide reality in user so as to realize The real name of user is accurately analyzed in the case where name based on user's remark information, and then can be abundant based on the real name analyzed The various functions of social networks；It, can also be further and when maximum posterior probability is less than or equal to predetermined probabilities threshold value The corresponding posterior probability of each candidate real name is modified according to preset tune power rule, and will be after maximum amendment Posterior probability optimal real name of the candidate real name as the user to be excavated, so as to further increase the knowledge to real name Other accuracy.

Fig. 3 is referred to, is the flow diagram of another data mining processing method provided in an embodiment of the present invention, it is described Method may include:

S301 obtains multiple user's remark informations corresponding with user to be excavated, and in the multiple user's remark information Middle mining analysis goes out at least one candidate user remark information, and will be identical at least one described candidate user remark information Candidate user remark information is respectively as candidate real name；

It is corresponding out to count each identical phonetic according to the corresponding phonetic of each candidate user remark information by S302 The existing frequency, and according to each identical corresponding frequency of occurrence of phonetic and the corresponding appearance frequency of each candidate's real name It is secondary, calculate the corresponding posterior probability of each candidate's real name；

Wherein, the specific implementation of S301 to S302 step may refer to the S101 in above-mentioned Fig. 1 corresponding embodiment extremely S102 is not discussed here.

S303, judges whether maximum posterior probability is greater than predetermined probabilities threshold value；

S304, using the candidate real name of the maximum posterior probability as the optimal real name of the user to be excavated；

Specifically, illustrating that the maximum posterior probability has enough confidence levels therefore can if S303 is judged as YES Using the optimal real name by the candidate real name of the maximum posterior probability as the user to be excavated, to guarantee the optimal reality Name is exactly the real real name of the user to be excavated.

S305, according to the corresponding user's remarks real name habit value of each candidate user remark information and described each The corresponding posterior probability of candidate real name calculates the corresponding weight order value of each candidate real name, and will be maximum Optimal real name of the candidate real name of weight order value as the user to be excavated；

Specifically, if S303 is judged as NO, the available each candidate user remark information difference of server The remarks attribute of corresponding user's (user of remarks is carried out to the user to be excavated), the remarks attribute of a user include The user carries out good friend to carry out the useful of remarks to good friend for user's remark information quantity of real name and the user in remarks The quantity of family remark information, the server are right respectively further according to remarks attribute calculating each candidate user remark information The user's remarks real name habit value answered, wherein user's remarks real name habit value refers to that user carries out good friend to be in remarks User's remark information quantity of real name and the user carry out the ratio of the quantity of all user's remark informations of remarks to good friend.Example Such as, the corresponding user of some candidate user remark information (user of remarks is carried out to the user to be excavated) is user A, if The quantity that user A carries out remarks all user's remark informations generated to other people is 100, and this 100 user's remark informations In have 70 user's remark informations be real real name, then can calculate user A user's remarks real name habit value be 70/ 100.It, can after the server calculates the corresponding user's remarks real name habit value of each candidate user remark information According to each corresponding user's remarks real name habit value of candidate user remark information and each candidate real name point Not corresponding posterior probability, calculates the corresponding weight order value of each candidate real name, and by maximum weight order value Optimal real name of the candidate real name as the user to be excavated.

Wherein, described according to the corresponding user's remarks real name habit value of each candidate user remark information and institute The corresponding posterior probability of each candidate real name is stated, the specific mistake of the corresponding weight order value of each candidate's real name is calculated Journey can be with are as follows: by taking one of candidate real name A as an example, the server can be by the corresponding multiple candidate users of candidate real name A Remark information (content of this multiple candidate user remark information is candidate real name A) is determined as multiple target candidate user remarks Then information calculates being averaged for the multiple corresponding user's remarks real name habit value of target candidate user remark information Value；Average value posterior probability corresponding with candidate real name A is added to obtain corresponding weight order value again, or The average value can be weighed plus corresponding sequence is obtained after a certain coefficient multiplied by the corresponding posterior probability of candidate real name A Weight values, other candidate's real names are all based on identical Computing Principle and calculate corresponding weight order value.

Optionally, if S205 calculated maximum revised posterior probability of institute in above-mentioned Fig. 2 corresponding embodiment according to So it is less than the predetermined probabilities threshold value, then can calculates the corresponding sequence power of revised posterior probability with the Computing Principle of S305 Weight values, more accurately to determine optimal real name.

It optionally, can if the calculated maximum weight order value of S305 institute is still less than the predetermined probabilities threshold value Weight order value is modified with the Computing Principle with the S205 in above-mentioned Fig. 2 corresponding embodiment, more accurately to determine Optimal real name.

The embodiment of the present invention by multiple user's remark informations mining analysis go out at least one candidate user remarks letter Breath, and will at least one described candidate user remark information identical candidate user remark information respectively as candidate real name, And according to the corresponding phonetic of each candidate user remark information, the corresponding frequency of occurrence of each identical phonetic, and root are counted According to the corresponding frequency of occurrence of each identical phonetic and the corresponding frequency of occurrence of each candidate real name, described in calculating The corresponding posterior probability of each candidate's real name, and when maximum posterior probability is greater than predetermined probabilities threshold value, can will be maximum Posterior probability optimal real name of the candidate real name as the user to be excavated, do not provide reality in user so as to realize The real name of user is accurately analyzed in the case where name based on user's remark information, and then can be abundant based on the real name analyzed The various functions of social networks；It, can also be further and when maximum posterior probability is less than or equal to predetermined probabilities threshold value According to each corresponding user's remarks real name habit value of candidate user remark information and each candidate real name difference Corresponding posterior probability, calculates the corresponding weight order value of each candidate real name, and by maximum weight order value Optimal real name of the candidate real name as the user to be excavated, so as to further increase the identification accuracy to real name.

Fig. 4 is referred to, is the flow diagram of another data mining processing method provided in an embodiment of the present invention, it is described Method may include:

S401 obtains multiple user's remark informations corresponding with user to be excavated, and in the multiple user's remark information Middle mining analysis goes out at least one candidate user remark information, and will be identical at least one described candidate user remark information Candidate user remark information is respectively as candidate real name；

It is corresponding out to count each identical phonetic according to the corresponding phonetic of each candidate user remark information by S402 The existing frequency, and according to each identical corresponding frequency of occurrence of phonetic and the corresponding appearance frequency of each candidate's real name It is secondary, calculate the corresponding posterior probability of each candidate's real name；

Wherein, the specific implementation of S401 to S402 step may refer to the S101 in above-mentioned Fig. 1 corresponding embodiment extremely S102 is not discussed here.

S403, judges whether maximum posterior probability is greater than predetermined probabilities threshold value；

S404, using the candidate real name of the maximum posterior probability as the optimal real name of the user to be excavated；

Specifically, illustrating that the maximum posterior probability has enough confidence levels therefore can if S403 is judged as YES Using the optimal real name by the candidate real name of the maximum posterior probability as the user to be excavated, to guarantee the optimal reality Name is exactly the real real name of the user to be excavated.

S405, the maximum candidate user remark information corresponding with the candidate real name of second largest posterior probability of selection, And feature extraction is carried out to selected candidate user remark information out, and according to the feature and preset sequence of extraction Rank model scores to the candidate real name of maximum and second largest posterior probability, and the high candidate real name that will score is as institute State the optimal real name of user to be excavated；

Specifically, the server can choose the time of maximum and second largest posterior probability if S403 is judged as NO The corresponding candidate user remark information of real name is selected, and feature extraction is carried out to selected candidate user remark information out, And according to the feature of extraction and preset sequence rank model to the candidate real name of maximum and second largest posterior probability into Row scoring, and the high candidate real name that will score is as the optimal real name of the user to be excavated.Wherein, the rank model can be with For the rank model based on pairwise.The feature may include that the user before the corresponding word cutting of candidate user remark information is standby Character length before infusing total character length of information, name, the character length after name, total character of candidate user remark information Length, user's remarks real name habit value of user to be excavated, the corresponding user of candidate user remark information (to user to be excavated into The user of row remarks) user's remarks real name habit value.

It before being scored using rank model, needs to establish and train rank model, establishes and train rank model Detailed process can be with are as follows: obtain the multiple training users for being used to train rank model corresponding with the user of known users real name Remark information, and it is candidate using training user's remark information identical in the multiple training user's remark information as training Real name；It will be each training user's remark information corresponding to the candidate real name of the training of user's real name as the first supported collection It closes；Corresponding first scoring values are gathered in first support；It will be complete for non-user's real name and with user's real name Each training user's remark information corresponding to the candidate real name of the training of phonetic is as the second support set；The second support set Corresponding second scoring values, first scoring values are greater than second scoring values；Extract the first support set Feature and it is described second support set feature, and according to it is described first support set feature and first scoring values, The feature of the second support set and second scoring values are established and train rank model.Therefore, it is based on rank model The process to score to the candidate real name of maximum and second largest posterior probability can be with are as follows: candidate real according to two inputted Support set belonging to the corresponding multiple candidate user remark informations of name (for the first support set or the second support set) Scoring values, calculate separately out the corresponding final scoring of two candidate real names.

Optionally, if S205 calculated maximum revised posterior probability of institute in above-mentioned Fig. 2 corresponding embodiment according to So be less than the predetermined probabilities threshold value, then it can be based on rank model maximum corresponding with second largest revised posterior probability Candidate real name in select optimal real name.

Optionally, if the calculated maximum weight order value of S305 institute in above-mentioned Fig. 3 corresponding embodiment is still less than The predetermined probabilities threshold value, then can be based on rank model in the maximum and second largest corresponding candidate real name of weight order value Select optimal real name.

The embodiment of the present invention by multiple user's remark informations mining analysis go out at least one candidate user remarks letter Breath, and will at least one described candidate user remark information identical candidate user remark information respectively as candidate real name, And according to the corresponding phonetic of each candidate user remark information, the corresponding frequency of occurrence of each identical phonetic, and root are counted According to the corresponding frequency of occurrence of each identical phonetic and the corresponding frequency of occurrence of each candidate real name, described in calculating The corresponding posterior probability of each candidate's real name, and when maximum posterior probability is greater than predetermined probabilities threshold value, can will be maximum Posterior probability optimal real name of the candidate real name as the user to be excavated, do not provide reality in user so as to realize The real name of user is accurately analyzed in the case where name based on user's remark information, and then can be abundant based on the real name analyzed The various functions of social networks；It, can also be further and when maximum posterior probability is less than or equal to predetermined probabilities threshold value Optimal real name is selected in the candidate real name of maximum and second largest posterior probability based on rank model, so as to further Improve the identification accuracy to real name.

Fig. 5 is referred to, is a kind of structural schematic diagram of data mining processing unit provided in an embodiment of the present invention, the number It can be applied in the server based on social networks according to processing unit 1 is excavated, the data mining processing unit 1 can wrap It includes: obtaining and excavate module 10, computing module 20, determining module 30；

Module 10 is excavated in the acquisition, for obtaining multiple user's remark informations corresponding with user to be excavated, and in institute It states mining analysis in multiple user's remark informations and goes out at least one candidate user remark information, and at least one described candidate is used Identical candidate user remark information is respectively as candidate real name in the remark information of family；

Specifically, the available multiple user's remark informations corresponding with user to be excavated of module 10 are excavated in the acquisition, Wherein, the user to be excavated refers to that server need to analyze the user for identifying its true real name, the multiple user's remarks Information refers to that other good friend users carry out the information of remarks to the user to be excavated.For example, the user to be excavated has 100 Good friend user, 100 good friend users have 75 good friend users to carry out remarks to the user to be excavated, then can be good by this 75 The information of friendly institute's remarks is as the multiple user's remark information.The acquisition excavates module 10 further in the multiple user Mining analysis goes out at least one candidate user remark information in remark information, and will at least one described candidate user remark information In identical candidate user remark information respectively as candidate real name.For example, at least one described candidate user remark information It is that " yellow AC ", 15 candidate users are standby that have 20 candidate user remark informations, which be " king AB ", 3 candidate user remark informations, It is " king AC " that note information, which is " yellow AB ", 30 candidate user remark informations, then can by " king AB ", " yellow AC ", " yellow AB ", " king AC " is as the candidate real name.

It further, is that a kind of structure for obtaining excavation module 10 provided in an embodiment of the present invention is shown please also refer to Fig. 6 It is intended to, it may include: to obtain screening unit 101, delete determination unit 102 that module 10 is excavated in the acquisition；

The acquisition screening unit 101, for acquisition multiple user's remark informations corresponding with user to be excavated, and according to Name tactical rule and preset surname matching list filter out in the multiple user's remark information meets the of surname condition A kind of user's remark information；

Specifically, the name tactical rule can criticize the number of words of normal name, such as normal name is generally 2 to 4 Chinese Word (name of monosyllabic name is 2 to 3 Chinese characters, and the name of two-character surname is 3 to 4 Chinese characters).Therefore, the acquisition screening unit 101 can First to carry out word cutting (if user's remark information is that " he is for information to the multiple users got based on effective word cutting algorithm King AB ", then user's remark information after word cutting becomes " king AB "), then by user's remarks after the word cutting comprising 2 to 4 Chinese characters Information sifting comes out, and obtains preliminary screening user's remark information, later further according to the monosyllabic name set in preset surname matching list Preliminary screening user's remark information comprising 2 words is matched, to detect preliminary screening user's remarks letter comprising 2 words First Chinese character of breath whether there is in the monosyllabic name set, and if it exists, then determine that the preliminary screening user comprising 2 words is standby Note information meets surname condition and as first kind user's remark information, is otherwise rejected；The acquisition screening unit 101 also simultaneously according to the two-character surname set in preset surname matching list to the preliminary screening user remark information comprising 4 words into Row matching whether there is with detecting the first two Chinese character of preliminary screening user's remark information comprising 4 words in the two-character surname collection In conjunction, and if it exists, then determine that preliminary screening user's remark information comprising 4 words meets surname condition and as the first kind User's remark information, is otherwise rejected；The acquisition screening unit 101 is also simultaneously according to the monosyllabic name set and the two-character surname Set matches preliminary screening user's remark information comprising 3 words, standby to detect the preliminary screening user comprising 3 words First Chinese character for infusing information whether there is whether there is in the two-character surname set, only in the monosyllabic name set or the first two Chinese character It detects to meet one of condition, it can determine that preliminary screening user's remark information comprising 3 words meets surname item Part and as first kind user's remark information, is rejected if being all unsatisfactory for.

The deletion determination unit 102, for will include proper noun and/or height in the first kind user remark information User's remark information of frequency word is deleted, and first kind user's remark information remaining after deletion is determined as at least one candidate User's remark information, and will at least one described candidate user remark information identical candidate user remark information respectively as Candidate real name；

Wherein, the proper noun may include such as teacher, master worker, sir, the proprietary role's word of Miss, the high frequency words May include such as tomorrow, the day after tomorrow, have a meal, drink water the contour existing word that occurs frequently.For example, if some first kind user's remark information is " teacher Wang ", then the deletion determination unit 102 can determine that first kind user's remark information includes proper noun, therefore, The deletion determination unit 102 can delete first kind user's remark information.

The computing module 20, for counting each identical spelling according to the corresponding phonetic of each candidate user remark information The corresponding frequency of occurrence of sound, and according to each identical corresponding frequency of occurrence of phonetic and each candidate real name point Not corresponding frequency of occurrence calculates the corresponding posterior probability of each candidate's real name；

Specifically, be a kind of structural schematic diagram of computing module 20 provided in an embodiment of the present invention please also refer to Fig. 7, The computing module 20 may include: phonetic acquiring unit 201, frequency statistics unit 202, the first probability calculation unit 203, Two probability calculation units 204；

The phonetic acquiring unit 201 is described complete for obtaining the corresponding full pinyin of each candidate user remark information Phonetic includes surname phonetic and name phonetic；

Specifically, respectively being waited in described at least one available described candidate user remark information of phonetic acquiring unit 201 The corresponding full pinyin of family remark information is selected, the full pinyin includes surname phonetic and name phonetic.For example, some is candidate User's remark information is " Zhang Xiaobo ", then corresponding full pinyin is " zhang xiaobo ", wherein surname phonetic is " zhang ", Name phonetic is " xiaobo ".

The frequency statistics unit 202, for counting each identical surname phonetic according to each candidate user remark information Corresponding frequency of occurrence and the corresponding frequency of occurrence of each same name phonetic；

First probability calculation unit 203, for according to the corresponding frequency of occurrence of each identical surname phonetic, each phase Corresponding frequency of occurrence and candidate user remark information total amount with name phonetic calculate each identical full pinyin and respectively correspond Joint probability；

Second probability calculation unit 204, for the corresponding appearance of identical full pinyin according to maximum joint probability The frequency and the corresponding frequency of occurrence of each candidate's real name, it is general to calculate the corresponding posteriority of each candidate's real name Rate；

Wherein, the calculation formula of the posterior probability are as follows: posterior probability P (candidate real name | best full pinyin)=best The frequency of occurrence of candidate real name in full pinyin/best full pinyin frequency of occurrence, the best full pinyin refers to maximum Close the identical full pinyin of probability, wherein if the full pinyin of candidate real name is not the best full pinyin, in best full pinyin Candidate's real name frequency of occurrence be 0.

It further, is a kind of first probability calculation unit 203 provided in an embodiment of the present invention then please also refer to Fig. 8 Structural schematic diagram, first probability calculation unit 203 may include: the first probability calculation subelement 2031, the second probability Computation subunit 2032, joint probability calculation subelement 2033；

The first probability calculation subelement 2031, for according to the corresponding frequency of occurrence of each identical surname phonetic with And candidate user remark information total amount, calculate corresponding first probability of each identical surname phonetic；

The second probability calculation subelement 2032, for according to the corresponding frequency of occurrence of each same name phonetic with And candidate user remark information total amount, calculate corresponding second probability of each same name phonetic；

The joint probability calculation subelement 2033, based on being carried out to each first probability and each second probability It calculates, to obtain the corresponding joint probability of each identical full pinyin；

Wherein, the calculation formula of the joint probability are as follows: joint probability P_{Full pinyin}=P_{Surname phonetic}*P_{Name phonetic}, P_{Surname phonetic}It is as described First probability, P_{Name phonetic}As described second probability.

For example, at least one described candidate user remark information includes 30 " Wu Xiaobo ", 20 " Wu little Bo ", 10 " Wu Xiaobo ", 10 " Zhang Xiaobo " and 30 " Zhang Haibo ", wherein identical full pinyin includes " wu xiaobo ", " zhang Xiaobo ", " zhang haibo ", then the first probability calculation subelement 2031 can calculate identical surname phonetic " wu " P_{Surname phonetic}The frequency of occurrence of=" wu "/candidate user remark information total amount=60/100, the first probability calculation subelement 2031 calculate the P of identical surname phonetic " zhang "_{Surname phonetic}The frequency of occurrence of=" zhang "/candidate user remark information total amount =40/100, the second probability calculation subelement 2032 can calculate the P of same name phonetic " xiaobo "_{Name phonetic}= The frequency of occurrence of " xiaobo "/candidate user remark information total amount=70/100, the second probability calculation subelement 2032 can To calculate the P of same name phonetic " haibo "_{Name phonetic}The frequency of occurrence of=" haibo "/candidate user remark information total amount= 30/100；So that the joint that the joint probability calculation subelement 2033 can calculate identical full pinyin " wu xiaobo " is general Rate P_{Full pinyin}The P of=identical surname phonetic " wu "_{Surname phonetic}* the P of same name phonetic " xiaobo "_{Name phonetic}=42/100, the joint Probability calculation subelement 2033 calculates the joint probability P of identical full pinyin " zhang xiaobo "_{Full pinyin}=identical surname phonetic The P of " zhang "_{Surname phonetic}* the P of same name phonetic " xiaobo "_{Name phonetic}=28/100, the joint probability calculation subelement 2033 Calculate the joint probability P of identical full pinyin " zhang haibo "_{Full pinyin}The P of=identical surname phonetic " zhang "_{Surname phonetic}* identical The P of name phonetic " haibo "_{Name phonetic}=12/100；It can be seen that the joint probability of identical full pinyin " wu xiaobo " is maximum, Therefore, identical full pinyin " wu xiaobo " is used as best full pinyin；Second probability calculation unit 204 may further Calculate the posterior probability P (Wu Xiaobo | best full pinyin " wu xiaobo ")=30/60 of " Wu Xiaobo ", the second probability meter The posterior probability P (Wu little Bo | best full pinyin " wu xiaobo ")=20/60 that unit 204 calculates " Wu little Bo " is calculated, it is described Second probability calculation unit 204 calculate the posterior probability P (Wu Xiaobo | best full pinyin " wu xiaobo ") of " Wu Xiaobo "= 10/60, second probability calculation unit 204 calculate " Zhang Xiaobo " posterior probability P (Zhang Xiaobo | best full pinyin " wu Xiaobo ")=0, second probability calculation unit 204 calculate " Zhang Haibo " posterior probability P (Zhang Haibo | best spelling Sound " wu xiaobo ")=0.

The determining module 30, for using the candidate real name of maximum posterior probability as the optimal of the user to be excavated Real name；

Specifically, the determining module 30 can incite somebody to action after calculating the corresponding posterior probability of each candidate's real name Optimal real name of the candidate real name of maximum posterior probability as the user to be excavated, it can determine the optimal real name What it is for the user to be excavated is really real name, so as to realize that the real name to user accurately identifies.For example, candidate real name Including " Wu Xiaobo ", " Wu little Bo ", " Wu Xiaobo ", " Zhang Xiaobo ", " Zhang Haibo ", wherein the posterior probability of " Wu Xiaobo " is 30/ 60, the posterior probability of " Wu little Bo " be the posterior probability that the posterior probability of 20/60, " Wu Xiaobo " is 10/60, " Zhang Xiaobo " be 0, The posterior probability of " Zhang Haibo " is 0, then " Wu Xiaobo " of maximum posterior probability can be determined as institute by the determining module 30 State the optimal real name of user to be excavated.

It further, is a kind of structural representation of determining module 30 provided in an embodiment of the present invention please also refer to Fig. 9 Figure, the determining module 30 may include: the first judging unit 301, the first determination unit 302, amendment determination unit 303, the Two judging units 304, the second determination unit 305, weight calculation determination unit 306, third judging unit 307, third determine single First 308, model score determination unit 309；

First judging unit 301, for judging whether maximum posterior probability is greater than predetermined probabilities threshold value；

First determination unit 302 will be described maximum if being judged as YES for first judging unit 301 Optimal real name of the candidate real name of posterior probability as the user to be excavated；

The amendment determination unit 303, if being judged as NO for first judging unit 301, according to preset tune Power rule is modified the corresponding posterior probability of each candidate real name, and by maximum revised posterior probability Optimal real name of the candidate real name as the user to be excavated；

Wherein, the tune power rule includes: mapping relations of the frequency of occurrence with corrected parameter of candidate real name, identical spelling The weight and mapping relations, the character complexity of candidate real name and the mapping relations of corrected parameter of corrected parameter of sound, candidate are real At least one in the character length and the mapping relations of corrected parameter of name, the mapping relations of the popularity of surname and corrected parameter Kind mapping relations.The tool of first judging unit 301, first determination unit 302 and the amendment determination unit 303 Body implementation may refer to the S201-S205 in above-mentioned Fig. 2 corresponding embodiment, be not discussed here.

The second judgment unit 304, for judging whether maximum posterior probability is greater than predetermined probabilities threshold value；

Second determination unit 305 will be described maximum if being judged as YES for the second judgment unit 304 Optimal real name of the candidate real name of posterior probability as the user to be excavated；

The weight calculation determination unit 306, if being judged as NO for the second judgment unit 304, according to After each corresponding user's remarks real name habit value of candidate user remark information and each candidate real name are corresponding It tests probability, calculates the corresponding weight order value of each candidate real name, and by the candidate real name of maximum weight order value Optimal real name as the user to be excavated；

Wherein, user's remarks real name habit value refers to that user believe in remarks for user's remarks of real name to good friend Quantity and the user are ceased to the ratio of the quantity of all user's remark informations of good friend's progress remarks.The second judgment unit 304, the specific implementation of second determination unit 305 and the weight calculation determination unit 306 may refer to above-mentioned S301-S305 in Fig. 3 corresponding embodiment, is not discussed here.

The third judging unit 307, for judging whether maximum posterior probability is greater than predetermined probabilities threshold value；

The third determination unit 308 will be described maximum if being judged as YES for the third judging unit 307 Optimal real name of the candidate real name of posterior probability as the user to be excavated；

The model score determination unit 309 selects maximum if being judged as NO for the third judging unit 307 Candidate user remark information corresponding with the candidate real name of second largest posterior probability, and to selected candidate user out Remark information carries out feature extraction, and according to the feature of extraction and preset sequence rank model to maximum and second largest The candidate real name of posterior probability scores, and the high candidate real name that will score is as the optimal real name of the user to be excavated；

Wherein, the third judging unit 307, the third determination unit 308 and the model score determination unit 309 specific implementation may refer to the S401-S405 in above-mentioned Fig. 4 corresponding embodiment, be not discussed here.

Optionally, when first judging unit 301, first determination unit 302 and the amendment determination unit 303 when executing corresponding operating, and the second judgment unit 304, second determination unit 305, the weight calculation determine Unit 306, the third judging unit 307, the third determination unit 308 and the model score determination unit 309 are equal It stops working.When the second judgment unit 304, second determination unit 305 and the weight calculation determination unit 306 When executing corresponding operating, first judging unit 301, first determination unit 302, the amendment determination unit 303, The third judging unit 307, the third determination unit 308 and the model score determination unit 309 stop working. When the third judging unit 307, the third determination unit 308 and the model score determination unit 309 are executing phase When should operate, first judging unit 301, first determination unit 302, the amendment determination unit 303, described second Judging unit 304, second determination unit 305 and the weight calculation determination unit 306 stop working.Wherein, institute Stating the first judging unit 301, the second judgment unit 304 and the third judging unit 307 can be the same judgement Unit；First determination unit 302, second determination unit 305 and the third determination unit 308 can be same A determination unit.

Again referring to Figure 10, it is the structural schematic diagram of another data mining processing unit 1 provided in an embodiment of the present invention, The data mining processing unit 1 can be applied in the server based on social networks, and the data mining processing unit 1 can It is further, described to include that module 10, computing module 20, determining module 30 are excavated in the acquisition in above-mentioned Fig. 5 corresponding embodiment Data mining processing unit 1 can also include: to obtain determining module 40, set determining module 50, model training module 60；

The acquisition determining module 40, it is corresponding with the user of known users real name for training rank model for obtaining Multiple training user's remark informations, and by training user's remark information identical in the multiple training user's remark information point Candidate real name Zuo Wei not trained；

The set determining module 50, for that will be that each training corresponding to the candidate real name of training of user's real name is used Family remark information is as the first support set；Corresponding first scoring values are gathered in first support；

The set determining module 50, being also used to will be for non-user's real name and with the full pinyin of user's real name The candidate real name of training corresponding to each training user's remark information as the second support set；The second support set corresponds to Second scoring values, first scoring values are greater than second scoring values；

The model training module 60, what feature and second support for extracting the first support set were gathered Feature, and according to it is described first support set feature and first scoring values, it is described second support set feature and Second scoring values are established and train rank model；

Wherein, pass through the acquisition determining module 40, the set determining module 50 and the model training module 60 After establishing and training rank model, the model score determination unit 309 in above-mentioned Fig. 9 corresponding embodiment can be made according to input Support set belonging to the corresponding multiple candidate user remark informations of two candidate's real names in rank model (is first Support set or the second support set) scoring values, calculate separately out the corresponding final scoring of two candidate real names.

Again referring to Figure 11, it is a kind of structural schematic diagram of server provided in an embodiment of the present invention, as shown in figure 11, institute Stating server 1000 may include: at least one processor 1001, such as CPU, at least one network interface 1004, user interface 1003, memory 1005, at least one communication bus 1002.Wherein, communication bus 1002 is for realizing between these components Connection communication.Wherein, user interface 1003 may include display screen (Display), keyboard (Keyboard), optional user interface 1003 can also include standard wireline interface and wireless interface.Network interface 1004 optionally may include that the wired of standard connects Mouth, wireless interface (such as WI-FI interface).Memory 1005 can be high speed RAM memory, be also possible to non-labile storage Device (non-volatile memory), for example, at least a magnetic disk storage.Memory 1005 optionally can also be at least one A storage device for being located remotely from aforementioned processor 1001.As shown in figure 11, the memory as a kind of computer storage medium It may include operating system, network communication module, Subscriber Interface Module SIM and equipment control application program in 1005.

In the server 1000 shown in Figure 11, network interface 1004 is mainly used for connecting client, to receive client User's remark information of transmission；And user interface 1003 is mainly used for providing the interface of input for user, obtains user's output Data；And processor 1001 can be used for that the equipment stored in memory 1005 is called to control application program, to realize

In one embodiment, the processor 1001 is executing acquisition multiple user's remarks corresponding with user to be excavated Information, and mining analysis goes out at least one candidate user remark information in the multiple user's remark information, and by described in extremely It is specific to execute when identical candidate user remark information is respectively as candidate real name in a few candidate user remark information:

Multiple user's remark informations corresponding with user to be excavated are obtained, and according to name tactical rule and preset surname Matching list filters out the first kind user's remark information for meeting surname condition in the multiple user's remark information；

It will include user's remark information deletion of proper noun and/or high frequency words in the first kind user remark information, And first kind user's remark information remaining after deletion is determined as at least one candidate user remark information, and by described in extremely Identical candidate user remark information is respectively as candidate real name in a few candidate user remark information.

In one embodiment, the processor 1001 is being executed according to the corresponding spelling of each candidate user remark information Sound, counts the corresponding frequency of occurrence of each identical phonetic, and according to the corresponding frequency of occurrence of each identical phonetic and The corresponding frequency of occurrence of each candidate real name, when calculating the corresponding posterior probability of each candidate real name, specifically It executes:

The corresponding full pinyin of each candidate user remark information is obtained, the full pinyin includes that surname phonetic and name are spelled Sound；

Each identical corresponding frequency of occurrence of surname phonetic and each phase are counted according to each candidate user remark information The corresponding frequency of occurrence with name phonetic；

According to the corresponding frequency of occurrence of each identical surname phonetic, the corresponding frequency of occurrence of each same name phonetic And candidate user remark information total amount, calculate the corresponding joint probability of each identical full pinyin；

It is right respectively according to the corresponding frequency of occurrence of identical full pinyin of maximum joint probability and each candidate real name The frequency of occurrence answered calculates the corresponding posterior probability of each candidate's real name.

In one embodiment, the processor 1001 is being executed according to corresponding appearances of each identical surname phonetic frequently The corresponding frequency of occurrence of secondary, each same name phonetic and candidate user remark information total amount, calculate each identical full pinyin It is specific to execute when corresponding joint probability:

According to the corresponding frequency of occurrence of each identical surname phonetic and candidate user remark information total amount, described in calculating Corresponding first probability of each identical surname phonetic；

According to the corresponding frequency of occurrence of each same name phonetic and candidate user remark information total amount, described in calculating Corresponding second probability of each same name phonetic；

Each first probability and each second probability are calculated, it is corresponding to obtain each identical full pinyin Joint probability.

In one embodiment, the processor 1001 is being executed the candidate real name of maximum posterior probability as described in It is specific to execute when the optimal real name of user to be excavated:

Judge whether maximum posterior probability is greater than predetermined probabilities threshold value；

If being judged as YES, using the candidate real name of the maximum posterior probability as the optimal reality of the user to be excavated Name；

If being judged as NO, the corresponding posterior probability of each candidate real name is carried out according to preset tune power rule Amendment, and using the candidate real name of maximum revised posterior probability as the optimal real name of the user to be excavated；

Wherein, the tune power rule includes: mapping relations of the frequency of occurrence with corrected parameter of candidate real name, identical spelling The weight and mapping relations, the character complexity of candidate real name and the mapping relations of corrected parameter of corrected parameter of sound, candidate are real At least one in the character length and the mapping relations of corrected parameter of name, the mapping relations of the popularity of surname and corrected parameter Kind mapping relations.

If being judged as NO, according to the corresponding user's remarks real name habit value of each candidate user remark information with And the corresponding posterior probability of each candidate real name, the corresponding weight order value of each candidate real name is calculated, and Using the candidate real name of maximum weight order value as the optimal real name of the user to be excavated；

Wherein, user's remarks real name habit value refers to that user believe in remarks for user's remarks of real name to good friend Quantity and the user are ceased to the ratio of the quantity of all user's remark informations of good friend's progress remarks.

If being judged as NO, select maximum candidate user corresponding with the candidate real name of second largest posterior probability standby Information is infused, and feature extraction is carried out to selected candidate user remark information out, and according to the feature of extraction and preset Sequence rank model score the candidate real name of maximum and second largest posterior probability, and by the high candidate real name that scores Optimal real name as the user to be excavated.

In one embodiment, the processor 1001 also executes:

The multiple training user's remark informations for being used to train rank model corresponding with the user of known users real name are obtained, And using training user's remark information identical in the multiple training user's remark information as the candidate real name of training；

It will be each training user's remark information corresponding to the candidate real name of the training of user's real name as the first support Set；Corresponding first scoring values are gathered in first support；

It will be each for non-user's real name and corresponding to the candidate real name of training of the full pinyin with user's real name Training user's remark information is as the second support set；Corresponding second scoring values are gathered in second support, and described first obtains Fractional value is greater than second scoring values；

The feature of the first support set and the feature of the second support set are extracted, and according to first support The feature of set and first scoring values, the feature of the second support set and second scoring values are established and are instructed Practice rank model；

Wherein, the first support set in the rank model after training and the second support set are for institute The candidate real name of input scores.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

The above disclosure is only the preferred embodiments of the present invention, cannot limit the right model of the present invention with this certainly It encloses, therefore equivalent changes made in accordance with the claims of the present invention, is still within the scope of the present invention.

Claims

1. a kind of data mining processing method characterized by comprising

Obtain multiple user's remark informations corresponding with user to be excavated, and the mining analysis in the multiple user's remark information At least one candidate user remark information out, and identical candidate user at least one described candidate user remark information is standby Information is infused respectively as candidate real name；

According to the corresponding phonetic of each candidate user remark information, the corresponding frequency of occurrence of each identical phonetic is counted, and According to each identical corresponding frequency of occurrence of phonetic and the corresponding frequency of occurrence of each candidate real name, calculate described each The corresponding posterior probability of candidate real name；The posterior probability is by the identical full pinyin pair with maximum joint probability The frequency of occurrence and the corresponding frequency of occurrence institute of each candidate's real name answered are calculated；Each identical full pinyin is right respectively The joint probability answered be by the corresponding frequency of occurrence of each identical surname phonetic, each same name phonetic it is corresponding go out The existing frequency and candidate user remark information total amount institute are calculated；

2. the method as described in claim 1, which is characterized in that described to obtain multiple user's remarks corresponding with user to be excavated Information, and mining analysis goes out at least one candidate user remark information in the multiple user's remark information, and by described in extremely Identical candidate user remark information is respectively as candidate real name in a few candidate user remark information, comprising:

Multiple user's remark informations corresponding with user to be excavated are obtained, and are matched according to name tactical rule and preset surname Table filters out the first kind user's remark information for meeting surname condition in the multiple user's remark information；

It will include user's remark information deletion of proper noun and/or high frequency words in the first kind user remark information, and will Remaining first kind user's remark information is determined as at least one candidate user remark information after deletion, and at least one by described in Identical candidate user remark information is respectively as candidate real name in a candidate user remark information.

3. the method as described in claim 1, which is characterized in that described according to the corresponding spelling of each candidate user remark information Sound, counts the corresponding frequency of occurrence of each identical phonetic, and according to the corresponding frequency of occurrence of each identical phonetic and The corresponding frequency of occurrence of each candidate real name calculates the corresponding posterior probability of each candidate real name, comprising:

The corresponding full pinyin of each candidate user remark information is obtained, the full pinyin includes surname phonetic and name phonetic；

According to the corresponding frequency of occurrence of each identical surname phonetic of each candidate user remark information statistics with it is each mutually of the same name The corresponding frequency of occurrence of word phonetic；

According to the corresponding frequency of occurrence of each identical surname phonetic, the corresponding frequency of occurrence of each same name phonetic and Candidate user remark information total amount calculates the corresponding joint probability of each identical full pinyin；

It is corresponding according to the corresponding frequency of occurrence of identical full pinyin of maximum joint probability and each candidate real name Frequency of occurrence calculates the corresponding posterior probability of each candidate's real name.

4. method as claimed in claim 3, which is characterized in that described according to the corresponding appearance frequency of each identical surname phonetic The corresponding frequency of occurrence of secondary, each same name phonetic and candidate user remark information total amount, calculate each identical full pinyin Corresponding joint probability, comprising:

According to the corresponding frequency of occurrence of each identical surname phonetic and candidate user remark information total amount, each phase is calculated Corresponding first probability with surname phonetic；

According to the corresponding frequency of occurrence of each same name phonetic and candidate user remark information total amount, each phase is calculated Corresponding second probability with name phonetic；

Each first probability and each second probability are calculated, to obtain the corresponding joint of each identical full pinyin Probability.

5. the method as described in claim 1, which is characterized in that it is described using the candidate real name of maximum posterior probability as described in The optimal real name of user to be excavated, comprising:

If being judged as YES, using the candidate real name of the maximum posterior probability as the optimal real name of the user to be excavated；

If being judged as NO, the corresponding posterior probability of each candidate real name is repaired according to preset tune power rule Just, and using the candidate real name of maximum revised posterior probability as the optimal real name of the user to be excavated；

Wherein, the tune power rule includes: mapping relations of the frequency of occurrence with corrected parameter of candidate real name, identical full pinyin Weight and the mapping relations of corrected parameter, the mapping relations of the character complexity of candidate real name and corrected parameter, candidate real name At least one of the mapping relations of character length and corrected parameter, the popularity of surname and mapping relations of corrected parameter are reflected Penetrate relationship.

6. the method as described in claim 1, which is characterized in that it is described using the candidate real name of maximum posterior probability as described in The optimal real name of user to be excavated, comprising:

If being judged as NO, according to the corresponding user's remarks real name habit value of each candidate user remark information and institute The corresponding posterior probability of each candidate real name is stated, calculates the corresponding weight order value of each candidate's real name, and will most Optimal real name of the candidate real name of big weight order value as the user to be excavated；

Wherein, it is user's remark information number of real name in remarks that user's remarks real name habit value, which refers to that user carries out good friend, Amount carries out the ratio of the quantity of all user's remark informations of remarks with the user to good friend.

7. the method as described in claim 1, which is characterized in that it is described using the candidate real name of maximum posterior probability as described in The optimal real name of user to be excavated, comprising:

If being judged as NO, maximum candidate user remarks letter corresponding with the candidate real name of second largest posterior probability is selected Breath, and feature extraction is carried out to selected candidate user remark information out, and according to the feature of extraction and preset row Sequence rank model scores to the candidate real name of maximum and second largest posterior probability, and will score high candidate real name as The optimal real name of the user to be excavated.

8. the method for claim 7, which is characterized in that further include:

The multiple training user's remark informations for being used to train rank model corresponding with the user of known users real name are obtained, and will Identical training user's remark information is respectively as the candidate real name of training in the multiple training user's remark information；

It will be each training user's remark information corresponding to the candidate real name of the training of user's real name as the first support set； Corresponding first scoring values are gathered in first support；

It will be for non-user's real name and each training corresponding to the candidate real name of training of the full pinyin with user's real name User's remark information is as the second support set；Corresponding second scoring values, first goals for are gathered in second support Value is greater than second scoring values；

The feature of the first support set and the feature of the second support set are extracted, and is gathered according to first support Feature and first scoring values, it is described second support set feature and second scoring values establish and train Rank model；

Wherein, the first support set in the rank model after training and the second support set are for being inputted Candidate real name score.

9. a kind of data mining processing unit characterized by comprising

It obtains and excavates module, for obtaining multiple user's remark informations corresponding with user to be excavated, and in the multiple user Mining analysis goes out at least one candidate user remark information in remark information, and will at least one described candidate user remark information In identical candidate user remark information respectively as candidate real name；

Computing module, for counting each identical phonetic and respectively corresponding according to the corresponding phonetic of each candidate user remark information Frequency of occurrence, and frequently according to each corresponding frequency of occurrence of identical phonetic and corresponding appearances of each candidate real name It is secondary, calculate the corresponding posterior probability of each candidate's real name；The posterior probability is by with maximum joint probability The corresponding frequency of occurrence of identical full pinyin and the corresponding frequency of occurrence institute of each candidate real name it is calculated；Each phase It with the corresponding joint probability of full pinyin is spelled by the corresponding frequency of occurrence of each identical surname phonetic, each same name The corresponding frequency of occurrence of sound and candidate user remark information total amount institute are calculated；

10. device as claimed in claim 9, which is characterized in that the acquisition excavates module and includes:

Screening unit is obtained, is advised for obtaining multiple user's remark informations corresponding with user to be excavated, and according to name structure Then filtered out in the multiple user's remark information with preset surname matching list meet surname condition first kind user it is standby Infuse information；

Delete determination unit, for by include in the first kind user remark information proper noun and/or high frequency words user Remark information is deleted, and first kind user's remark information remaining after deletion is determined as at least one candidate user remarks letter Breath, and by identical candidate user remark information at least one described candidate user remark information respectively as candidate real name.

11. device as claimed in claim 9, which is characterized in that the computing module includes:

Phonetic acquiring unit, for obtaining the corresponding full pinyin of each candidate user remark information, the full pinyin includes surname Family name's phonetic and name phonetic；

Frequency statistics unit, it is corresponding out for counting each identical surname phonetic according to each candidate user remark information The existing frequency and the corresponding frequency of occurrence of each same name phonetic；

First probability calculation unit, for according to the corresponding frequency of occurrence of each identical surname phonetic, each same name phonetic It is general to calculate the corresponding joint of each identical full pinyin for corresponding frequency of occurrence and candidate user remark information total amount Rate；

Second probability calculation unit, for according to the corresponding frequency of occurrence of identical full pinyin of maximum joint probability and described The corresponding frequency of occurrence of each candidate's real name calculates the corresponding posterior probability of each candidate real name.

12. device as claimed in claim 11, which is characterized in that first probability calculation unit includes:

First probability calculation subelement, for standby according to the corresponding frequency of occurrence of each identical surname phonetic and candidate user Informational capacity is infused, corresponding first probability of each identical surname phonetic is calculated；

Second probability calculation subelement, for standby according to the corresponding frequency of occurrence of each same name phonetic and candidate user Informational capacity is infused, corresponding second probability of each same name phonetic is calculated；

Joint probability calculation subelement, it is each to obtain for calculating each first probability and each second probability The corresponding joint probability of identical full pinyin.

13. device as claimed in claim 9, which is characterized in that the determining module includes:

First judging unit, for judging whether maximum posterior probability is greater than predetermined probabilities threshold value；

First determination unit, if being judged as YES for first judging unit, by the candidate of the maximum posterior probability Optimal real name of the real name as the user to be excavated；

Determination unit is corrected, if being judged as NO for first judging unit, according to preset tune power rule to described each The corresponding posterior probability of candidate real name is modified, and using the candidate real name of maximum revised posterior probability as institute State the optimal real name of user to be excavated；

14. device as claimed in claim 9, which is characterized in that the determining module includes:

Second judgment unit, for judging whether maximum posterior probability is greater than predetermined probabilities threshold value；

Second determination unit, if being judged as YES for the second judgment unit, by the candidate of the maximum posterior probability Optimal real name of the real name as the user to be excavated；

Weight calculation determination unit, if being judged as NO for the second judgment unit, according to each candidate user remarks The corresponding user's remarks real name habit value of information and the corresponding posterior probability of each candidate real name, described in calculating The corresponding weight order value of each candidate's real name, and using the candidate real name of maximum weight order value as the use to be excavated The optimal real name at family；

15. device as claimed in claim 9, which is characterized in that the determining module includes:

Third judging unit, for judging whether maximum posterior probability is greater than predetermined probabilities threshold value；

Third determination unit, if being judged as YES for the third judging unit, by the candidate of the maximum posterior probability Optimal real name of the real name as the user to be excavated；

Model score determination unit selects maximum and second largest posteriority if being judged as NO for the third judging unit The corresponding candidate user remark information of the candidate real name of probability, and selected candidate user remark information out is carried out special Sign extracts, and according to the feature of extraction and preset sequence rank model to the maximum candidate with second largest posterior probability Real name scores, and the high candidate real name that will score is as the optimal real name of the user to be excavated.

16. device as claimed in claim 15, which is characterized in that further include:

Determining module is obtained, it is corresponding with the user of known users real name for training multiple training of rank model for obtaining User's remark information, and using training user's remark information identical in the multiple training user's remark information as training Candidate real name；

Gather determining module, for that will be each training user's remark information corresponding to the candidate real name of training of user's real name As the first support set；Corresponding first scoring values are gathered in first support；

The set determining module, be also used to will for non-user's real name and with user's real name full pinyin training Each training user's remark information corresponding to candidate real name is as the second support set；The second support set corresponding second Fractional value, first scoring values are greater than second scoring values；

Model training module, for extracting the feature of the first support set and the feature of the second support set, and root It is obtained according to the feature and described second of the feature of the first support set and first scoring values, the second support set Fractional value is established and trains rank model；