CN106021235A

CN106021235A - Data mining processing method and device

Info

Publication number: CN106021235A
Application number: CN201610387322.5A
Authority: CN
Inventors: 黄引刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Tencent Cloud Computing Beijing Co Ltd
Priority date: 2016-06-01
Filing date: 2016-06-01
Publication date: 2016-10-12
Anticipated expiration: 2036-06-01
Also published as: CN106021235B

Abstract

The embodiment of the invention discloses a data mining processing method and device. The method comprises the steps that multiple pieces of user note information corresponding to a use to be mined are obtained, and at least one piece of candidate user note information is mined and analyzed from the multiple pieces of user note information, and the same pieces of candidate user note information in the candidate user note information are used as candidate real names respectively; according to phonetic alphabets corresponding to the candidate user note information, occurrence frequencies corresponding to the same phonetic alphabets are calculated, and a posterior probability corresponding to each candidate real name is calculated according to the occurrence frequencies corresponding to the same phonetic alphabets and the occurrence frequency corresponding to each candidate real name; the candidate real name with the largest posterior probability is used as the optimal real name of the user to be mined. By the adoption of the data mining processing method and device, the user real time can be accurately recognized, and therefore the functions of a social network are enriched.

Description

A kind of data mining processing method and device

Technical field

The present invention relates to Internet technical field, particularly relate to a kind of data mining processing method and device.

Background technology

Along with the development of Internet technology, more and more users can participate in social networks.User is adding Before social networks, need first to carry out user's registration, and the user name registered can be that user is the most defeated The character entered, i.e. user's registration information can not comprise the real name of user.And to enter in social networks Row security monitoring, then need the real name of user just to can recognize that whether user is fraudulent user；And for example to Social networks carries out accurate crowd's excavation, is then also required to use the real name of user.But for current society Hand over network, the real name obtaining user independently can only be provided by user, and when user is reluctant to provide real name, The server side of social networks is the real name that cannot learn this user, thus causes the partial function of social networks Cannot be fully achieved.

Summary of the invention

The embodiment of the present invention provides a kind of data mining processing method and device, can accurately analyze and identify use Family real name, with the function of abundant social networks.

Embodiments provide a kind of data mining processing method, including:

Obtain the multiple user remark informations corresponding with user to be excavated, and at the plurality of user's remark information Middle mining analysis goes out at least one candidate user remark information, and is believed by least one candidate user remarks described Candidate user remark information identical in breath is respectively as candidate's real name；

According to the phonetic that each candidate user remark information is the most corresponding, add up going out of each identical phonetic correspondence respectively The existing frequency, and frequency of occurrence and the described each candidate's real name according to described each identical phonetic correspondence respectively is the most right The frequency of occurrence answered, calculates the posterior probability that described each candidate's real name is the most corresponding；

Using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated.

Correspondingly, the embodiment of the present invention additionally provides a kind of data mining processing means, including:

Obtain and excavate module, for obtaining the multiple user remark informations corresponding with user to be excavated, and in institute State mining analysis in multiple user's remark information and go out at least one candidate user remark information, and by described at least Candidate user remark information identical in one candidate user remark information is respectively as candidate's real name；

Computing module, for the phonetic the most corresponding according to each candidate user remark information, adds up each identical spelling The frequency of occurrence that cent is not corresponding, and according to frequency of occurrence corresponding to described each identical phonetic and described respectively The frequency of occurrence that candidate's real name is the most corresponding, calculates the posterior probability that described each candidate's real name is the most corresponding；

Determine module, for using candidate's real name of maximum posterior probability as the optimum of described user to be excavated Real name.

It is standby that the embodiment of the present invention goes out at least one candidate user by mining analysis in multiple user's remark informations Note information, and by candidate user remark information identical at least one candidate user remark information described respectively As candidate's real name, and according to the phonetic of each candidate user remark information correspondence respectively, add up each identical phonetic Distinguish corresponding frequency of occurrence, and distinguish corresponding frequency of occurrence and described each time according to described each identical phonetic Select the frequency of occurrence that real name is the most corresponding, calculate the posterior probability that described each candidate's real name is the most corresponding, finally Using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated, such that it is able to realize Accurately analyze the real name of user based on user's remark information in the case of user does not provide real name, and then The various functions of social networks can be enriched based on the real name analyzed.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to enforcement In example or description of the prior art, the required accompanying drawing used is briefly described, it should be apparent that, describe below In accompanying drawing be only some embodiments of the present invention, for those of ordinary skill in the art, do not paying On the premise of going out creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the schematic flow sheet of a kind of data mining processing method that the embodiment of the present invention provides；

Fig. 2 is the schematic flow sheet of the another kind of data mining processing method that the embodiment of the present invention provides；

Fig. 3 is the schematic flow sheet of another data mining processing method that the embodiment of the present invention provides；

Fig. 4 is the schematic flow sheet of another data mining processing method that the embodiment of the present invention provides；

Fig. 5 is the structural representation of a kind of data mining processing means that the embodiment of the present invention provides；

Fig. 6 is a kind of structural representation obtaining excavation module that the embodiment of the present invention provides；

Fig. 7 is the structural representation of a kind of computing module that the embodiment of the present invention provides；

Fig. 8 is the structural representation of a kind of first probability calculation unit that the embodiment of the present invention provides；

Fig. 9 is a kind of structural representation determining module that the embodiment of the present invention provides；

Figure 10 is the structural representation of the another kind of data mining processing means that the embodiment of the present invention provides；

Figure 11 is the structural representation of a kind of server that the embodiment of the present invention provides.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clearly Chu, be fully described by, it is clear that described embodiment be only a part of embodiment of the present invention rather than Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creation The every other embodiment obtained under property work premise, broadly falls into the scope of protection of the invention.

Refer to Fig. 1, be the schematic flow sheet of a kind of data mining processing method that the embodiment of the present invention provides, Described method may include that

S101, obtains the multiple user remark informations corresponding with user to be excavated, and standby the plurality of user In note information, mining analysis goes out at least one candidate user remark information, and by least one candidate user described Candidate user remark information identical in remark information is respectively as candidate's real name；

Concrete, it is standby that server based on social networks can obtain the multiple users corresponding with user to be excavated Note information, wherein, described user to be excavated refers to that server need to analyze the user identifying its real real name, The plurality of user's remark information refers to that other good friend users carry out the information of remarks to described user to be excavated. Such as, described user to be excavated has 100 good friend users, 100 good friend users to have 75 good friend users couple Described user to be excavated carries out remarks, then can be using the information of these 75 good friend institute remarks as the plurality of use Family remark information.Described server further in the plurality of user's remark information mining analysis go out at least one Individual candidate user remark information, and by candidate user identical at least one candidate user remark information described Remark information is respectively as candidate's real name.Such as, at least one candidate user remark information described there are 20 Candidate user remark information is " king AB ", 3 candidate user remark informations be " yellow AC ", 15 Candidate user remark information is " yellow AB ", 30 candidate user remark informations are " king AC ", then may be used Using by " king AB ", " yellow AC ", " yellow AB ", " king AC " all as described candidate's real name.

Wherein, described server mining analysis in the plurality of user's remark information goes out at least one candidate use The detailed process of family remark information can be: obtains the multiple user remark informations corresponding with user to be excavated, And filter out full in the plurality of user's remark information according to name tactical rule and the surname matching list preset First kind user's remark information of foot surname condition；Described first kind user's remark information will comprise proprietary name User's remark information of word and/or high frequency words is deleted, and first kind user's remark information remaining after deleting It is defined as at least one candidate user remark information.Wherein, described proper noun can include such as teacher, teacher Proprietary role's words such as Fu, sir, Miss, described high frequency words can include such as tomorrow, the day after tomorrow, have a meal, drinks The contour existing word that occurs frequently of water.Such as, if certain first kind user's remark information is " teacher Wang ", then can be true Fixed this first kind user's remark information comprises proper noun, therefore, it can delete this first kind user's remarks letter Breath.

Wherein, described name tactical rule can be made a comment or criticism the number of words of normal name, as normal name is generally 2 to 4 Individual Chinese character (name of monosyllabic name is 2 to 3 Chinese characters, and the name of two-character surname is 3 to 4 Chinese characters).Therefore, institute State and filter out full in the plurality of user's remark information according to name tactical rule and the surname matching list preset The detailed process of first kind user's remark information of foot surname condition can be: described server can first based on Effectively cut the word algorithm multiple users to getting and cut word (if user's remark information is " he for information It is king AB ", then cut the user's remark information after word and become " king AB "), then 2 to 4 Chinese characters will be comprised Cut the user's remark information after word to screen, obtain Preliminary screening user's remark information, afterwards further according in advance If surname matching list in monosyllabic name set the Preliminary screening user's remark information comprising 2 words is mated, To detect whether first Chinese character of the Preliminary screening user's remark information comprising 2 words is present in described monosyllabic name In set, if existing, it is determined that the Preliminary screening user's remark information comprising 2 words meets surname condition also As first kind user's remark information, otherwise rejected；And simultaneously according to the surname matching list preset In two-character surname set the Preliminary screening user's remark information comprising 4 words is mated, with detection comprise 4 Whether the first two Chinese character of Preliminary screening user's remark information of individual word is present in described two-character surname set, if depositing , it is determined that the Preliminary screening user's remark information comprising 4 words meets surname condition and as first Class user's remark information, is otherwise rejected；And simultaneously according to described monosyllabic name set and described two-character surname set pair The Preliminary screening user's remark information comprising 3 words mates, and comprises the Preliminary screening of 3 words with detection First Chinese character of user's remark information whether is present in described monosyllabic name set or whether the first two Chinese character is present in Described two-character surname set, as long as detecting and meeting one of them condition, i.e. may determine that and comprises the preliminary of 3 words Screening user's remark information meets surname condition and as first kind user's remark information, if being all unsatisfactory for Then rejected.

S102, according to the phonetic that each candidate user remark information is the most corresponding, adds up each identical phonetic the most right The frequency of occurrence answered, and distinguish corresponding frequency of occurrence and described each candidate's real name according to described each identical phonetic The most corresponding frequency of occurrence, calculates the posterior probability that described each candidate's real name is the most corresponding；

Concrete, described server can obtain each candidate at least one candidate user remark information described and use The full pinyin that family remark information is the most corresponding, described full pinyin includes surname phonetic and name phonetic.Such as, Certain candidate user remark information is " Zhang Xiaobo ", then corresponding full pinyin is " zhang xiaobo ", wherein, Surname phonetic is " zhang ", and name phonetic is " xiaobo ".Described server is used further according to described each candidate Family remark information adds up frequency of occurrence corresponding to each identical surname phonetic and each same name phonetic is the most right The frequency of occurrence answered, such as, at least one candidate user remark information described include 20 " Zhang Xiaobo ", 25 " Zhang Xiaobo ", 10 " Wang Xiafang " and 5 " Zhang Haibo ", then be appreciated that identical surname is spelled Sound includes " zhang " and " wang ", and same name phonetic includes " xiaobo " and " haibo ", thus can To count the frequency of occurrence of identical surname phonetic " zhang " for 50, identical surname phonetic " wang " goes out The existing frequency is 10, and the frequency of occurrence of same name phonetic " xiaobo " is 55, same name phonetic " haibo " Frequency of occurrence be 5.Hereafter, described server further according to the respectively corresponding frequency of occurrence of each identical surname phonetic, The frequency of occurrence of each same name phonetic correspondence respectively and candidate user remark information total amount, calculate each identical The joint probability that full pinyin is respectively corresponding, and according to appearance corresponding to the identical full pinyin of maximum joint probability The frequency of occurrence of the frequency and described each candidate's real name correspondence respectively, calculates described each candidate's real name correspondence respectively Posterior probability.

Wherein, the described frequency of occurrence corresponding respectively according to each identical surname phonetic, each same name phonetic divide Not corresponding frequency of occurrence and candidate user remark information total amount, calculate each identical full pinyin correspondence respectively The detailed process of joint probability is: the frequency of occurrence the most corresponding according to each identical surname phonetic and candidate use Family remark information total amount, calculates the first probability that described each identical surname phonetic is the most corresponding；According to each identical The frequency of occurrence of name phonetic correspondence respectively and candidate user remark information total amount, calculate described mutually the most of the same name The second probability that word phonetic is the most corresponding；Each described first probability and each described second probability are calculated, The joint probability the most corresponding to obtain each identical full pinyin.

Wherein, the computing formula of described joint probability is: joint probability P_{Full pinyin}=P_{Surname phonetic}*P_{Name phonetic}, P_{Surname phonetic}It is described first probability, P_{Name phonetic}It is described second probability.The computing formula of described posterior probability Appearance frequency for candidate's real name of: posterior probability P (candidate's real name | optimal full pinyin)=in optimal full pinyin The frequency of occurrence of secondary/optimal full pinyin, described optimal full pinyin refers to the identical full pinyin of the joint probability of maximum, Wherein, if the full pinyin of candidate's real name is not described optimal full pinyin, then this candidate in optimal full pinyin The frequency of occurrence of real name is 0.Such as, at least one candidate user remark information described includes 30 " Wu Xiao Ripple ", 20 " Wu little Bo ", 10 " Wu Xiaobo ", 10 " Zhang Xiaobo " and 30 " Zhang Haibo ", The most identical full pinyin includes " wu xiaobo ", " zhang xiaobo ", " zhang haibo ", mutually the most of the same surname The P of family name's phonetic " wu "_{Surname phonetic}The frequency of occurrence of=" wu "/candidate user remark information total amount=60/100, phase P with surname phonetic " zhang "_{Surname phonetic}The frequency of occurrence of=" zhang "/candidate user remark information total amount =40/100, the P of same name phonetic " xiaobo "_{Name phonetic}Frequency of occurrence/the candidate user of=" xiaobo " is standby Note informational capacity=70/100, the P of same name phonetic " haibo "_{Name phonetic}The frequency of occurrence of=" haibo "/ Candidate user remark information total amount=30/100；Such that it is able to calculate the connection of identical full pinyin " wu xiaobo " Close probability P_{Full pinyin}The P of=identical surname phonetic " wu "_{Surname phonetic}* same name phonetic " xiaobo " P_{Name phonetic}=42/100, joint probability P of identical full pinyin " zhang xiaobo "_{Full pinyin}=identical surname phonetic The P of " zhang "_{Surname phonetic}* the P of same name phonetic " xiaobo "_{Name phonetic}=28/100, identical full pinyin " zhang Haibo " joint probability P_{Full pinyin}The P of=identical surname phonetic " zhang "_{Surname phonetic}* same name phonetic " haibo " P_{Name phonetic}=12/100；As can be seen here, the joint probability of identical full pinyin " wu xiaobo " is maximum, therefore, Using identical full pinyin " wu xiaobo " as optimal full pinyin；Can calculate " Wu Xiaobo " further Posterior probability P (Wu Xiaobo | optimal full pinyin " wu xiaobo ")=30/60, posterior probability P of " Wu little Bo " (Wu little Bo | optimal full pinyin " wu xiaobo ")=20/60, posterior probability P of " Wu Xiaobo " (Wu Xiaobo | Optimal full pinyin " wu xiaobo ")=10/60, posterior probability P of " Zhang Xiaobo " (Zhang Xiaobo | optimal spelling Sound " wu xiaobo ")=0, posterior probability P of " Zhang Haibo " (Zhang Haibo | optimal full pinyin " wu xiaobo ") =0.

S103, using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated；

Concrete, after described server calculates the posterior probability that described each candidate's real name is the most corresponding, permissible Using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated, i.e. can be by described What optimum real name was defined as described user to be excavated is really real name, such that it is able to realize carrying out the real name of user Accurately identify.Such as, candidate's real name include " Wu Xiaobo ", " Wu little Bo ", " Wu Xiaobo ", " Zhang Xiaobo ", " Zhang Haibo ", wherein, the posterior probability of " Wu Xiaobo " is 30/60, the posterior probability of " Wu little Bo " is 20/60, The posterior probability of " Wu Xiaobo " is 10/60, the posterior probability of " Zhang Xiaobo " is 0, the posteriority of " Zhang Haibo " Probability is 0, then " Wu Xiaobo " of maximum posterior probability can be defined as the optimum of described user to be excavated Real name.

Refer to Fig. 2, be the flow process signal of the another kind of data mining processing method that the embodiment of the present invention provides Figure, described method may include that

S201, obtains the multiple user remark informations corresponding with user to be excavated, and standby the plurality of user In note information, mining analysis goes out at least one candidate user remark information, and by least one candidate user described Candidate user remark information identical in remark information is respectively as candidate's real name；

S202, according to the phonetic that each candidate user remark information is the most corresponding, adds up each identical phonetic the most right The frequency of occurrence answered, and distinguish corresponding frequency of occurrence and described each candidate's real name according to described each identical phonetic The most corresponding frequency of occurrence, calculates the posterior probability that described each candidate's real name is the most corresponding；

Wherein, during the specific implementation of S201 to S202 step may refer to above-mentioned Fig. 1 correspondence embodiment S101 to S102, is not discussed here.

S203, it is judged that whether maximum posterior probability is more than predetermined probabilities threshold value；

Concrete, after described server calculates the posterior probability that described each candidate's real name is the most corresponding, permissible Determine whether that whether the posterior probability of maximum is more than predetermined probabilities threshold value.

S204, using candidate's real name of the posterior probability of described maximum as the optimum real name of described user to be excavated；

Concrete, if S203 is judged as YES, illustrate that the posterior probability of described maximum has enough credibilitys, Therefore, it can candidate's real name optimum real name as described user to be excavated of posterior probability using described maximum, To ensure that described optimum real name is exactly the real real name of described user to be excavated.

S205, repaiies the posterior probability that described each candidate's real name is corresponding respectively according to default power rule of adjusting Just, and using candidate's real name of maximum revised posterior probability as the optimum real name of described user to be excavated；

Concrete, if S203 is judged as NO, the most described server can be according to default tune power rule to described The posterior probability of each candidate's real name correspondence respectively is modified, and by the time of maximum revised posterior probability Select real name as the optimum real name of described user to be excavated.Described tune power rule includes: the appearance of candidate's real name The frequency and the mapping relations of corrected parameter, weight and the mapping relations of corrected parameter, the candidate of identical full pinyin The character complexity of real name and the mapping relations of corrected parameter, character length and the corrected parameter of candidate's real name Mapping relations, the popularity of surname and at least one mapping relations in the mapping relations of corrected parameter.Institute The mapping relations of the frequency of occurrence and corrected parameter of stating candidate's real name refer to multiple different frequency of occurrence scope with Mapping relations between multiple different corrected parameters, the biggest frequency of occurrence scope the biggest corresponding correction ginseng Number, is then negative for the corrected parameter that the frequency of occurrence scope less than frequency threshold value is corresponding, as candidate is real The frequency of occurrence of name A is more than the frequency of occurrence of candidate's real name B, then the corrected parameter that candidate's real name A is corresponding is more Greatly, the posterior probability that i.e. candidate's real name A is corresponding will increase more numerical value；And for example the going out of candidate's real name C The existing frequency less than frequency threshold value, then needs to reduce the posterior probability that candidate's real name C is corresponding.Described identical spelling The weight of sound refers to multiple different proportion range and multiple different correction ginsengs from the mapping relations of corrected parameter Mapping relations between number, the biggest proportion range the biggest corresponding corrected parameter, and for less than weight threshold The corrected parameter that the proportion range of value is corresponding can be then negative, and it is standby that the quantity of full pinyin as identical in certain takies family The ratio of note informational capacity is the biggest, then the weight of this identical full pinyin is the biggest, then this identical full pinyin is corresponding Corrected parameter is the biggest, i.e. can rise to the posteriority that multiple candidate's real names of this identical full pinyin are the most corresponding Probability.The character complexity of described candidate's real name refers to multiple different character from the mapping relations of corrected parameter Mapping relations between complexity from multiple different corrected parameters, the biggest character complexity is corresponding the biggest Corrected parameter, is difficult to write and the Chinese of the most common (the biggest character complexity) as certain candidate's real name comprises Word, then this candidate's real name can corresponding bigger corrected parameter, this candidate real name i.e. can be greatly improved corresponding Posterior probability.The described character length of candidate's real name refers to multiple different from the mapping relations of corrected parameter Mapping relations between character length from multiple different corrected parameters, the longest character length is corresponding the biggest Corrected parameter, as the character length of candidate's real name A is more than the character length of candidate's real name B, then candidate's real name A can corresponding bigger corrected parameter, the posteriority that i.e. can more greatly improve candidate's real name A corresponding is general Rate.The popularity of described surname refers to multiple different surname popularity from the mapping relations of corrected parameter From the mapping relations between multiple different corrected parameters, the corrected parameter that the most universal surname is corresponding is the biggest, Can be then negative for the corrected parameter corresponding less than the surname of popularity threshold value, such as surname " king " The corrected parameter that corresponding corrected parameter is more corresponding than surname " Ouyang " is big.Therefore, described server can root According to the described a kind of mapping relations adjusted in power rule or the combination of multiple mapping relations, to described each candidate's real name Posterior probability corresponding respectively is modified that (process of correction can be to increase posterior probability, it is also possible to is fall Low posterior probability), and using candidate's real name of maximum revised posterior probability as described user's to be excavated Optimum real name.

It is standby that the embodiment of the present invention goes out at least one candidate user by mining analysis in multiple user's remark informations Note information, and by candidate user remark information identical at least one candidate user remark information described respectively As candidate's real name, and according to the phonetic of each candidate user remark information correspondence respectively, add up each identical phonetic Distinguish corresponding frequency of occurrence, and distinguish corresponding frequency of occurrence and described each time according to described each identical phonetic Select the frequency of occurrence that real name is the most corresponding, calculate the posterior probability that described each candidate's real name is the most corresponding, and work as When maximum posterior probability is more than predetermined probabilities threshold value, can using candidate's real name of maximum posterior probability as The optimum real name of described user to be excavated, such that it is able to realize in the case of user does not provide real name based on User's remark information accurately analyzes the real name of user, and then can be based on the abundant social activity of the real name analyzed The various functions of network；And when maximum posterior probability is less than or equal to predetermined probabilities threshold value, it is also possible to According to the power rule of adjusting preset, the posterior probability that described each candidate's real name is corresponding respectively is modified further, And using candidate's real name of maximum revised posterior probability as the optimum real name of described user to be excavated, from And the identification accuracy to real name can be improved further.

Refer to Fig. 3, be the flow process signal of another data mining processing method that the embodiment of the present invention provides Figure, described method may include that

S301, obtains the multiple user remark informations corresponding with user to be excavated, and standby the plurality of user In note information, mining analysis goes out at least one candidate user remark information, and by least one candidate user described Candidate user remark information identical in remark information is respectively as candidate's real name；

S302, according to the phonetic that each candidate user remark information is the most corresponding, adds up each identical phonetic the most right The frequency of occurrence answered, and distinguish corresponding frequency of occurrence and described each candidate's real name according to described each identical phonetic The most corresponding frequency of occurrence, calculates the posterior probability that described each candidate's real name is the most corresponding；

Wherein, during the specific implementation of S301 to S302 step may refer to above-mentioned Fig. 1 correspondence embodiment S101 to S102, is not discussed here.

S303, it is judged that whether maximum posterior probability is more than predetermined probabilities threshold value；

S304, using candidate's real name of the posterior probability of described maximum as the optimum real name of described user to be excavated；

Concrete, if S303 is judged as YES, illustrate that the posterior probability of described maximum has enough credibilitys, Therefore, it can candidate's real name optimum real name as described user to be excavated of posterior probability using described maximum, To ensure that described optimum real name is exactly the real real name of described user to be excavated.

S305, the user remarks real name custom value respectively corresponding according to described each candidate user remark information and The posterior probability that described each candidate's real name is the most corresponding, calculates the sequence power that described each candidate's real name is the most corresponding Weight values, and using candidate's real name of maximum weight order value as the optimum real name of described user to be excavated；

Concrete, if S303 is judged as NO, the most described server can obtain described each candidate user remarks letter The remarks attribute of the user (i.e. described user to be excavated is carried out the user of remarks) that breath is the most corresponding, one The remarks attribute of user include this user good friend is carried out in remarks for real name user's remark information quantity and should User carries out the quantity of all user's remark informations of remarks to good friend, and described server is further according to described remarks User's remarks real name custom value that each candidate user remark information described in property calculation is the most corresponding, wherein, institute State user's remarks real name custom value and refer to that user carries out the user's remark information quantity in remarks for real name to good friend With this user good friend carried out the ratio of the quantity of all user's remark informations of remarks.Such as, certain candidate The user (described user to be excavated i.e. carries out the user of remarks) that user's remark information is corresponding is user A, If the quantity that user A carries out all user's remark informations that remarks are generated to other people is 100, and this 100 Having 70 user's remark informations in individual user's remark information is real real name, then can calculate user A's User's remarks real name custom value is 70/100.Described server calculates described each candidate user remark information and divides After not corresponding user's remarks real name custom value, can be the most right according to described each candidate user remark information The posterior probability of the user's remarks real name custom value answered and described each candidate's real name correspondence respectively, calculates described The weight order value that each candidate's real name is the most corresponding, and using candidate's real name of maximum weight order value as institute State the optimum real name of user to be excavated.

Wherein, the described user remarks real name custom value the most corresponding according to described each candidate user remark information And the posterior probability that described each candidate's real name is the most corresponding, calculate the row that described each candidate's real name is the most corresponding The detailed process of sequence weighted value can be: as a example by one of them candidate's real name A, and described server can be by (content of these multiple candidate user remark informations is equal for multiple candidate user remark informations corresponding to candidate's real name A For candidate's real name A) it is defined as multiple target candidate user's remark information, then calculate the plurality of target and wait Select the meansigma methods of user's remarks real name custom value that family remark information is the most corresponding；Again by described meansigma methods with Posterior probability corresponding to candidate's real name A carries out the weight order value being added to obtain correspondence, or can be by institute State meansigma methods plus being multiplied by posterior probability corresponding to candidate's real name A after a certain coefficient again to obtain the sequence of correspondence Weighted value, other candidate's real name is all based on identical Computing Principle and calculates the weight order value of correspondence.

Optionally, if the maximum revised posteriority that calculated of the S205 in above-mentioned Fig. 2 correspondence embodiment Probability still less than described predetermined probabilities threshold value, then can calculate revised posteriority with the Computing Principle of S305 The weight order value that probability is corresponding, to determine optimum real name more accurately.

Optionally, if the maximum weight order value that S305 is calculated is still less than described predetermined probabilities threshold value, Then with the Computing Principle of the S205 in above-mentioned Fig. 2 correspondence embodiment, weight order value can be modified, with Determine optimum real name more accurately.

It is standby that the embodiment of the present invention goes out at least one candidate user by mining analysis in multiple user's remark informations Note information, and by candidate user remark information identical at least one candidate user remark information described respectively As candidate's real name, and according to the phonetic of each candidate user remark information correspondence respectively, add up each identical phonetic Distinguish corresponding frequency of occurrence, and distinguish corresponding frequency of occurrence and described each time according to described each identical phonetic Select the frequency of occurrence that real name is the most corresponding, calculate the posterior probability that described each candidate's real name is the most corresponding, and work as When maximum posterior probability is more than predetermined probabilities threshold value, can using candidate's real name of maximum posterior probability as The optimum real name of described user to be excavated, such that it is able to realize in the case of user does not provide real name based on User's remark information accurately analyzes the real name of user, and then can be based on the abundant social activity of the real name analyzed The various functions of network；And when maximum posterior probability is less than or equal to predetermined probabilities threshold value, it is also possible to Further according to the respectively corresponding user's remarks real name custom value of described each candidate user remark information and described The posterior probability that each candidate's real name is the most corresponding, calculates the weight order value that described each candidate's real name is the most corresponding, And using candidate's real name of maximum weight order value as the optimum real name of described user to be excavated, such that it is able to Improve the identification accuracy to real name further.

Refer to Fig. 4, be the flow process signal of another data mining processing method that the embodiment of the present invention provides Figure, described method may include that

S401, obtains the multiple user remark informations corresponding with user to be excavated, and standby the plurality of user In note information, mining analysis goes out at least one candidate user remark information, and by least one candidate user described Candidate user remark information identical in remark information is respectively as candidate's real name；

S402, according to the phonetic that each candidate user remark information is the most corresponding, adds up each identical phonetic the most right The frequency of occurrence answered, and distinguish corresponding frequency of occurrence and described each candidate's real name according to described each identical phonetic The most corresponding frequency of occurrence, calculates the posterior probability that described each candidate's real name is the most corresponding；

Wherein, during the specific implementation of S401 to S402 step may refer to above-mentioned Fig. 1 correspondence embodiment S101 to S102, is not discussed here.

S403, it is judged that whether maximum posterior probability is more than predetermined probabilities threshold value；

S404, using candidate's real name of the posterior probability of described maximum as the optimum real name of described user to be excavated；

Concrete, if S403 is judged as YES, illustrate that the posterior probability of described maximum has enough credibilitys, Therefore, it can candidate's real name optimum real name as described user to be excavated of posterior probability using described maximum, To ensure that described optimum real name is exactly the real real name of described user to be excavated.

S405, selects the candidate user remarks that candidate's real name of maximum with second largest posterior probability is the most corresponding Information, and to selected go out candidate user remark information carry out feature extraction, and according to the described spy of extraction Candidate's real name of the maximum and second largest posterior probability is marked by the sequence rank model seeking peace default, and Using the high candidate's real name optimum real name as described user to be excavated of marking；

Concrete, if S403 is judged as NO, the most described server can select the maximum and second largest posteriority general The candidate user remark information that candidate's real name of rate is the most corresponding, and to selected go out candidate user remarks letter Breath carries out feature extraction, and the described feature and default sequence rank model according to extraction is to maximum and second Candidate's real name of big posterior probability is marked, and using candidate's real name high for scoring as described use to be excavated The optimum real name at family.Wherein, described rank model can be rank model based on pairwise.Described spy Levy can include candidate user remark information corresponding cut the total character length of user's remark information before word, surname Character length before Ming, the character length after name, total character length of candidate user remark information, wait to dig User's remarks real name custom value of pick user, the user that candidate user remark information is corresponding (treats digging user Carry out the user of remarks) user's remarks real name custom value.

Before using rank model to mark, need to set up and training rank model, set up and train rank The detailed process of model can be: obtains corresponding with the user of known users real name being used for and trains rank model Multiple training user's remark informations, and by identical training user in the plurality of training user's remark information Remark information is respectively as training candidate's real name；By corresponding to training candidate's real name of described user's real name Each training user's remark information supports set as first；Described first supports corresponding first scoring values of set； Each by corresponding to training candidate's real name of non-described user's real name and the full pinyin with described user's real name Training user's remark information supports set as second；Described second supports corresponding second scoring values of set, Described first scoring values is more than described second scoring values；Extract the described first feature supporting set and institute State the feature of the second support set, and according to the described first feature supporting set and described first scoring values, Described second feature supporting set and described second scoring values are set up and train rank model.Therefore, base The process marked candidate's real name of the maximum and second largest posterior probability in rank model can be: root According to the support set belonging to multiple candidate user remark informations that two the candidate's real names inputted are the most corresponding The scoring values of (being that the first support set or second supports set), calculates two candidate's real names pair respectively The final scoring answered.

Optionally, if the maximum revised posteriority that calculated of the S205 in above-mentioned Fig. 2 correspondence embodiment Probability, then can be based on rank model in the maximum and second largest correction still less than described predetermined probabilities threshold value After candidate's real name corresponding to posterior probability in select optimum real name.

Optionally, if the maximum weight order value that the S305 in above-mentioned Fig. 3 correspondence embodiment is calculated depends on So less than described predetermined probabilities threshold value, then can be based on rank model in the maximum and second largest weight order value Corresponding candidate's real name is selected optimum real name.

It is standby that the embodiment of the present invention goes out at least one candidate user by mining analysis in multiple user's remark informations Note information, and by candidate user remark information identical at least one candidate user remark information described respectively As candidate's real name, and according to the phonetic of each candidate user remark information correspondence respectively, add up each identical phonetic Distinguish corresponding frequency of occurrence, and distinguish corresponding frequency of occurrence and described each time according to described each identical phonetic Select the frequency of occurrence that real name is the most corresponding, calculate the posterior probability that described each candidate's real name is the most corresponding, and work as When maximum posterior probability is more than predetermined probabilities threshold value, can using candidate's real name of maximum posterior probability as The optimum real name of described user to be excavated, such that it is able to realize in the case of user does not provide real name based on User's remark information accurately analyzes the real name of user, and then can be based on the abundant social activity of the real name analyzed The various functions of network；And when maximum posterior probability is less than or equal to predetermined probabilities threshold value, it is also possible to It is based further on rank model and selects optimum real in candidate's real name of the maximum and second largest posterior probability Name, such that it is able to improve the identification accuracy to real name further.

Refer to Fig. 5, be the structural representation of a kind of data mining processing means that the embodiment of the present invention provides, Described data mining processing means 1 can apply in server based on social networks, described data mining Processing means 1 may include that acquisition is excavated module 10, computing module 20, determined module 30；

Described acquisition excavates module 10, for obtaining the multiple user remark informations corresponding with user to be excavated, And mining analysis goes out at least one candidate user remark information in the plurality of user's remark information, and by institute State candidate user remark information identical at least one candidate user remark information respectively as candidate's real name；

Concrete, the described excavation module 10 that obtains can obtain the multiple user remarks corresponding with user to be excavated Information, wherein, described user to be excavated refers to that server need to analyze the user identifying its real real name, The plurality of user's remark information refers to that other good friend users carry out the information of remarks to described user to be excavated. Such as, described user to be excavated has 100 good friend users, 100 good friend users to have 75 good friend users couple Described user to be excavated carries out remarks, then can be using the information of these 75 good friend institute remarks as the plurality of use Family remark information.Described acquisition excavates module 10 mining analysis in the plurality of user's remark information further Go out at least one candidate user remark information, and by identical at least one candidate user remark information described Candidate user remark information is respectively as candidate's real name.Such as, at least one candidate user remark information described In have 20 candidate user remark informations to be " king AB ", 3 candidate user remark informations be " yellow AC ", 15 candidate user remark informations are " yellow AB ", 30 candidate user remark informations are " king AC ", Then can be using " king AB ", " yellow AC ", " yellow AB ", " king AC " all as described candidate's real name.

Further, please also refer to Fig. 6, it is that a kind of acquisition that the embodiment of the present invention provides excavates module 10 Structural representation, described acquisition excavate module 10 may include that acquisition screening unit 101, delete determine Unit 102；

Described acquisition screening unit 101, for obtaining the multiple user remark informations corresponding with user to be excavated, And filter out full in the plurality of user's remark information according to name tactical rule and the surname matching list preset First kind user's remark information of foot surname condition；

Concrete, described name tactical rule can be made a comment or criticism the number of words of normal name, as normal name is generally 2 To 4 Chinese characters (name of monosyllabic name is 2 to 3 Chinese characters, and the name of two-character surname is 3 to 4 Chinese characters).Therefore, The multiple users got can first be entered based on effectively cutting word algorithm by described acquisition screening unit 101 for information Row is cut word and (if user's remark information is " he is king AB ", is then cut the user's remark information after word and become " king AB "), then by comprise 2 to 4 Chinese characters cut word after user's remark information screen, tentatively sieved Select family remark information, afterwards further according to the monosyllabic name set in default surname matching list to comprising 2 words Preliminary screening user's remark information mates, and comprises Preliminary screening user's remark information of 2 words with detection First Chinese character whether be present in described monosyllabic name set, if exist, it is determined that comprise the preliminary of 2 words Screening user's remark information meets surname condition and as first kind user's remark information, is otherwise picked Remove；Two-character surname set in the described acquisition screening unit 101 surname matching list that basis is preset the most simultaneously is to comprising 4 Preliminary screening user's remark information of individual word mates, and the Preliminary screening user comprising 4 words with detection is standby Whether the first two Chinese character of note information is present in described two-character surname set, if existing, it is determined that comprise 4 words Preliminary screening user's remark information meet surname condition and as first kind user's remark information, otherwise Rejected；Described acquisition screening unit 101 is gone back simultaneously according to described monosyllabic name set and described two-character surname set pair The Preliminary screening user's remark information comprising 3 words mates, and comprises the Preliminary screening of 3 words with detection First Chinese character of user's remark information whether is present in described monosyllabic name set or whether the first two Chinese character is present in Described two-character surname set, as long as detecting and meeting one of them condition, i.e. may determine that and comprises the preliminary of 3 words Screening user's remark information meets surname condition and as first kind user's remark information, if being all unsatisfactory for Then rejected.

Described deletion determines unit 102, for by described first kind user's remark information comprises proper noun and / or high frequency words user's remark information delete, and will delete after remaining first kind user's remark information determine For at least one candidate user remark information, and by identical at least one candidate user remark information described Candidate user remark information is respectively as candidate's real name；

Wherein, described proper noun can include such as proprietary role's words such as teacher, master worker, sir, Miss, Described high frequency words can include as tomorrow, the day after tomorrow, have a meal, drink water the contour existing word that occurs frequently.Such as, if certain Individual first kind user's remark information is " teacher Wang ", the most described deletion determine unit 102 may determine that this first Class user's remark information comprises proper noun, therefore, described deletion determine unit 102 can delete this first Class user's remark information.

Described computing module 20, for the phonetic the most corresponding according to each candidate user remark information, statistics is each The frequency of occurrence that identical phonetic is respectively corresponding, and according to frequency of occurrence corresponding to described each identical phonetic and The frequency of occurrence that described each candidate's real name is the most corresponding, the posteriority calculating described each candidate's real name the most corresponding is general Rate；

Concrete, please also refer to Fig. 7, it is the structure of a kind of computing module 20 that the embodiment of the present invention provides Schematic diagram, described computing module 20 may include that phonetic acquiring unit 201, frequency statistics unit 202, First probability calculation unit the 203, second probability calculation unit 204；

Described phonetic acquiring unit 201, for obtaining the full pinyin that each candidate user remark information is the most corresponding, Described full pinyin includes surname phonetic and name phonetic；

Concrete, described phonetic acquiring unit 201 can obtain at least one candidate user remark information described In the respectively corresponding full pinyin of each candidate user remark information, described full pinyin includes that surname phonetic and name are spelled Sound.Such as, certain candidate user remark information is " Zhang Xiaobo ", then corresponding full pinyin is " zhang xiaobo ", Wherein, surname phonetic is " zhang ", and name phonetic is " xiaobo ".

Described frequency statistics unit 202, for adding up each identical surname according to described each candidate user remark information Phonetic distinguishes the frequency of occurrence that corresponding frequency of occurrence is the most corresponding with each same name phonetic；

Described first probability calculation unit 203, for the appearance frequency the most corresponding according to each identical surname phonetic The frequency of occurrence of same name phonetic secondary, each correspondence respectively and candidate user remark information total amount, calculate each The joint probability that identical full pinyin is the most corresponding；

Described second probability calculation unit 204 is corresponding for the identical full pinyin according to maximum joint probability The frequency of occurrence of frequency of occurrence and described each candidate's real name correspondence respectively, calculates described each candidate's real name respectively Corresponding posterior probability；

Wherein, the computing formula of described posterior probability is: posterior probability P (candidate's real name | optimal full pinyin)= The frequency of occurrence of the frequency of occurrence of the candidate's real name in optimal full pinyin/optimal full pinyin, described optimal spelling Sound refers to the identical full pinyin of joint probability of maximum, wherein, if the full pinyin of candidate's real name be not described Good full pinyin, then the frequency of occurrence of this candidate's real name in optimal full pinyin is 0.

Further, then please also refer to Fig. 8, it is one the first probability calculation list of embodiment of the present invention offer The structural representation of unit 203, described first probability calculation unit 203 may include that the first probability calculation Unit the 2031, second probability calculation subelement 2032, joint probability calculation subelement 2033；

Described first probability calculation subelement 2031, for the appearance the most corresponding according to each identical surname phonetic The frequency and candidate user remark information total amount, calculate described each identical surname phonetic respectively corresponding first general Rate；

Described second probability calculation subelement 2032, for the appearance the most corresponding according to each same name phonetic The frequency and candidate user remark information total amount, calculate described each same name phonetic respectively corresponding second general Rate；

Described joint probability calculation subelement 2033, for each described first probability and each described second probability Calculate, the joint probability the most corresponding to obtain each identical full pinyin；

Wherein, the computing formula of described joint probability is: joint probability P_{Full pinyin}=P_{Surname phonetic}*P_{Name phonetic}, P_{Surname phonetic}It is described first probability, P_{Name phonetic}It is described second probability.

Such as, at least one candidate user remark information described includes 30 " Wu Xiaobo ", 20 " Wu is little Ripple ", 10 " Wu Xiaobo ", 10 " Zhang Xiaobo " and 30 " Zhang Haibo ", the most identical full pinyin Including " wu xiaobo ", " zhang xiaobo ", " zhang haibo ", the most described first probability calculation subelement 2031 P that can calculate identical surname phonetic " wu "_{Surname phonetic}Frequency of occurrence/the candidate user of=" wu " is standby Note informational capacity=60/100, described first probability calculation subelement 2031 calculates identical surname phonetic The P of " zhang "_{Surname phonetic}The frequency of occurrence of=" zhang "/candidate user remark information total amount=40/100, institute State the second probability calculation subelement 2032 and can calculate the P of same name phonetic " xiaobo "_{Name phonetic}= The frequency of occurrence of " xiaobo "/candidate user remark information total amount=70/100, described second probability calculation sub-list Unit 2032 can calculate the P of same name phonetic " haibo "_{Name phonetic}Frequency of occurrence/the candidate of=" haibo " User's remark information total amount=30/100；Thus described joint probability calculation subelement 2033 can calculate identical Joint probability P of full pinyin " wu xiaobo "_{Full pinyin}The P of=identical surname phonetic " wu "_{Surname phonetic}Mutually the most of the same name The P of word phonetic " xiaobo "_{Name phonetic}=42/100, described joint probability calculation subelement 2033 calculates identical Joint probability P of full pinyin " zhang xiaobo "_{Full pinyin}The P of=identical surname phonetic " zhang "_{Surname phonetic}* phase P with name phonetic " xiaobo "_{Name phonetic}=28/100, described joint probability calculation subelement 2033 calculates Joint probability P of identical full pinyin " zhang haibo "_{Full pinyin}The P of=identical surname phonetic " zhang "_{Surname phonetic} * the P of same name phonetic " haibo "_{Name phonetic}=12/100；As can be seen here, identical full pinyin " wu xiaobo " Joint probability maximum, therefore, using identical full pinyin " wu xiaobo " as optimal full pinyin；Described Two probability calculation unit 204 can calculate further " Wu Xiaobo " posterior probability P (Wu Xiaobo | optimal Full pinyin " wu xiaobo ")=30/60, described second probability calculation unit 204 calculates " Wu little Bo " Posterior probability P (Wu little Bo | optimal full pinyin " wu xiaobo ")=20/60, described second probability calculation unit 204 posterior probability P calculating " Wu Xiaobo " (Wu Xiaobo | optimal full pinyin " wu xiaobo ")=10/60, Described second probability calculation unit 204 calculate " Zhang Xiaobo " posterior probability P (Zhang Xiaobo | optimal spelling Sound " wu xiaobo ")=0, described second probability calculation unit 204 calculates the posterior probability of " Zhang Haibo " P (Zhang Haibo | optimal full pinyin " wu xiaobo ")=0.

Described determine module 30, for using candidate's real name of maximum posterior probability as described user to be excavated Optimum real name；

Concrete, after calculating the posterior probability that described each candidate's real name is the most corresponding, described determine module 30 Can be using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated, i.e. can be by What described optimum real name was defined as described user to be excavated is really real name, such that it is able to realize the real name to user Accurately identify.Such as, candidate's real name includes " Wu Xiaobo ", " Wu little Bo ", " Wu Xiaobo ", " Zhang Xiao Ripple ", " Zhang Haibo ", wherein, the posterior probability of " Wu Xiaobo " is 30/60, the posterior probability of " Wu little Bo " Be 20/60, the posterior probability of " Wu Xiaobo " be 10/60, the posterior probability of " Zhang Xiaobo " be 0, " Zhang Haibo " Posterior probability be 0, the most described determine that " Wu Xiaobo " of maximum posterior probability can be determined by module 30 Optimum real name for described user to be excavated.

Further, please also refer to Fig. 9, it it is a kind of knot determining module 30 of embodiment of the present invention offer Structure schematic diagram, described determine module 30 may include that the first judging unit 301, first determine unit 302, Correction determines unit the 303, second judging unit 304, second determines that unit 305, weight calculation determine unit 306, the 3rd judging unit the 307, the 3rd determines that unit 308, model score determine unit 309；

Described first judging unit 301, for judging that whether the posterior probability of maximum is more than predetermined probabilities threshold value；

Described first determines unit 302, if being judged as YES, then by described for described first judging unit 301 Candidate's real name of maximum posterior probability is as the optimum real name of described user to be excavated；

Described correction determines unit 303, if being judged as NO for described first judging unit 301, then according to pre- If power rule of adjusting the posterior probability that described each candidate's real name is corresponding respectively is modified, and maximum is repaiied Candidate's real name of the posterior probability after just is as the optimum real name of described user to be excavated；

Wherein, described power rule is adjusted to include: the frequency of occurrence of candidate's real name and the mapping relations of corrected parameter, The weight of identical full pinyin and the mapping relations of corrected parameter, the character complexity of candidate's real name and corrected parameter Mapping relations, character length and the mapping relations of corrected parameter of candidate's real name, surname popularity with At least one mapping relations in the mapping relations of corrected parameter.Described first judging unit 301, described first Determine that unit 302 and described correction determine that the specific implementation of unit 303 may refer to above-mentioned Fig. 2 pair Answer the S201-S205 in embodiment, be not discussed here.

Described second judging unit 304, for judging that whether the posterior probability of maximum is more than predetermined probabilities threshold value；

Described second determines unit 305, if being judged as YES, then by described for described second judging unit 304 Candidate's real name of maximum posterior probability is as the optimum real name of described user to be excavated；

Described weight calculation determines unit 306, if being judged as NO for described second judging unit 304, then root The user remarks real name custom value the most corresponding according to described each candidate user remark information and described each candidate are real The posterior probability that name is the most corresponding, calculates the weight order value that described each candidate's real name is the most corresponding, and will be Candidate's real name of big weight order value is as the optimum real name of described user to be excavated；

Wherein, described user's remarks real name custom value refers to that user carries out the user in remarks for real name to good friend Remark information quantity and this user carry out the ratio of the quantity of all user's remark informations of remarks to good friend.Institute State the second judging unit 304, described second determine that unit 305 and described weight calculation determine unit 306 Specific implementation may refer to the S301-S305 in above-mentioned Fig. 3 correspondence embodiment, is not discussed here.

Described 3rd judging unit 307, for judging that whether the posterior probability of maximum is more than predetermined probabilities threshold value；

Described 3rd determines unit 308, if being judged as YES, then by described for described 3rd judging unit 307 Candidate's real name of maximum posterior probability is as the optimum real name of described user to be excavated；

Described model score determines unit 309, if being judged as NO for described 3rd judging unit 307, then selects Select the candidate user remark information that candidate's real name of maximum with second largest posterior probability is the most corresponding, and to institute The candidate user remark information selected carries out feature extraction, and according to the described feature extracted and the row of presetting Candidate's real name of the maximum and second largest posterior probability is marked by sequence rank model, and by time high for scoring Select real name as the optimum real name of described user to be excavated；

Wherein, described 3rd judging unit 307, the described 3rd determine that unit 308 and described model score are true The specific implementation of cell 309 may refer to the S401-S405 in above-mentioned Fig. 4 correspondence embodiment, here No longer repeat.

Optionally, when described first judging unit 301, described first determine that unit 302 and described correction are true Cell 303 perform corresponding operating time, described second judging unit 304, described second determine unit 305, Described weight calculation determine unit 306, described 3rd judging unit 307, the described 3rd determine unit 308 with And described model score determines that unit 309 all quits work.When described second judging unit 304, described second Determine unit 305 and described weight calculation determine unit 306 perform corresponding operating time, described first sentences Disconnected unit 301, described first determine unit 302, described correction determine unit 303, the described 3rd judge single Unit the 307, the described 3rd determines that unit 308 and described model score determine that unit 309 all quits work.When Described 3rd judging unit 307, the described 3rd determine that unit 308 and described model score determine unit 309 When performing corresponding operating, described first judging unit 301, described first determine unit 302, described correction Determine unit 303, described second judging unit 304, described second determine unit 305 and described weight meter Calculation determines that unit 306 all quits work.Wherein, described first judging unit 301, described second judging unit 304 and described 3rd judging unit 307 can be same judging unit；Described first determine unit 302, Described second determines that unit 305 and the described 3rd determines that unit 308 can be same to determine unit.

Refer to Figure 10 again, be the structure of the another kind of data mining processing means 1 that the embodiment of the present invention provides Schematic diagram, described data mining processing means 1 can apply in server based on social networks, described Data mining processing means 1 can include that the acquisition in above-mentioned Fig. 5 correspondence embodiment is excavated module 10, calculated Module 20, determining module 30, further, described data mining processing means 1 can also include: obtains Determine that module 40, set determine module 50, model training module 60；

Described acquisition determines module 40, trains for obtaining corresponding with the user of known users real name being used for Multiple training user's remark informations of rank model, and by identical in the plurality of training user's remark information Training user's remark information is respectively as training candidate's real name；

Described set determines module 50, for by for described user's real name training candidate's real name corresponding to each Training user's remark information supports set as first；Described first supports corresponding first scoring values of set；

Described set determines module 50, is additionally operable to for non-described user's real name and have described user's real name Each training user's remark information corresponding to training candidate's real name of full pinyin supports set as second；Described Second supports corresponding second scoring values of set, and described first scoring values is more than described second scoring values；

Described model training module 60, supports for extracting the described first feature supporting set and described second Set feature, and according to described first support set feature and described first scoring values, described second The feature and described second scoring values that support set are set up and train rank model；

Wherein, determine that module 40, described set determine module 50 and described model training by described acquisition After rank model is set up and trained to module 60, the model score in above-mentioned Fig. 9 correspondence embodiment can be made true Cell 309 is according to multiple candidate user remarks of the correspondence respectively of two candidate's real names in input rank model The scoring values of the support set (being that the first support set or second supports set) belonging to information, counts respectively Calculate two final scorings corresponding to candidate's real name.

Refer to Figure 11 again, be the structural representation of a kind of server that the embodiment of the present invention provides, such as Figure 11 Shown in, described server 1000 may include that at least one processor 1001, such as CPU, at least one Individual network interface 1004, user interface 1003, memorizer 1005, at least one communication bus 1002.Its In, communication bus 1002 is for realizing the connection communication between these assemblies.Wherein, user interface 1003 Can include display screen (Display), keyboard (Keyboard), optional user interface 1003 can also include The wireline interface of standard, wave point.Network interface 1004 optionally can include standard wireline interface, Wave point (such as WI-FI interface).Memorizer 1005 can be high-speed RAM memorizer, it is also possible to right and wrong Unstable memorizer (non-volatile memory), for example, at least one disk memory.Memorizer 1005 Optionally can also is that at least one is located remotely from the storage device of aforementioned processor 1001.As shown in figure 11, As the memorizer 1005 of a kind of computer-readable storage medium can include operating system, network communication module, Subscriber Interface Module SIM and equipment control application program.

In the server 1000 shown in Figure 11, network interface 1004 is mainly used in connecting client, to connect Receive user's remark information that client sends；And user interface 1003 is mainly used in providing the user connecing of input Mouthful, obtain the data of user's output；And processor 1001 may be used for calling storage in memorizer 1005 Equipment controls application program, to realize

In one embodiment, described processor 1001 is performing multiple use that acquisition is corresponding with user to be excavated Family remark information, and mining analysis goes out at least one candidate user remarks in the plurality of user's remark information Information, and identical candidate user remark information at least one candidate user remark information described is made respectively During for candidate's real name, specifically perform:

Obtain the multiple user remark informations corresponding with user to be excavated, and according to name tactical rule with default Surname matching list in the plurality of user's remark information, filter out that to meet the first kind user of surname condition standby Note information；

The user's remark information comprising proper noun and/or high frequency words in described first kind user's remark information is deleted Remove, and will delete after remaining first kind user's remark information be defined as at least one candidate user remarks letter Breath, and using candidate user remark information identical at least one candidate user remark information described as Candidate's real name.

In one embodiment, described processor 1001 is the most right according to each candidate user remark information in execution The phonetic answered, adds up the frequency of occurrence that each identical phonetic is the most corresponding, and according to described each identical phonetic difference The frequency of occurrence that corresponding frequency of occurrence is the most corresponding with described each candidate's real name, calculates described each candidate's real name When distinguishing corresponding posterior probability, specifically perform:

Obtain the full pinyin that each candidate user remark information is respectively corresponding, described full pinyin include surname phonetic and Name phonetic；

According to described each candidate user remark information add up the respectively corresponding frequency of occurrence of each identical surname phonetic and The frequency of occurrence that each same name phonetic is the most corresponding；

According to going out of each identical surname phonetic correspondence respectively of the frequency of occurrence of correspondence, each same name phonetic respectively The existing frequency and candidate user remark information total amount, calculate the joint probability that each identical full pinyin is the most corresponding；

Frequency of occurrence and described each candidate's real name that identical full pinyin according to maximum joint probability is corresponding are divided Not corresponding frequency of occurrence, calculates the posterior probability that described each candidate's real name is the most corresponding.

In one embodiment, described processor 1001 is performing according to each identical surname phonetic correspondence respectively The frequency of occurrence of frequency of occurrence, each same name phonetic correspondence respectively and candidate user remark information total amount, When calculating the joint probability of each identical full pinyin correspondence respectively, specifically perform:

The frequency of occurrence the most corresponding according to each identical surname phonetic and candidate user remark information total amount, meter Calculate the first probability that described each identical surname phonetic is the most corresponding；

The frequency of occurrence the most corresponding according to each same name phonetic and candidate user remark information total amount, meter Calculate the second probability that described each same name phonetic is the most corresponding；

Each described first probability and each described second probability are calculated, to obtain each identical full pinyin respectively Corresponding joint probability.

In one embodiment, described processor 1001 is performing candidate's real name work of maximum posterior probability During for the optimum real name of described user to be excavated, specifically perform:

Judge that whether the posterior probability of maximum is more than predetermined probabilities threshold value；

If being judged as YES, then using candidate's real name of the posterior probability of described maximum as described user's to be excavated Optimum real name；

If being judged as NO, then adjust power rule general to the posteriority that described each candidate's real name is the most corresponding according to preset Rate is modified, and using candidate's real name of maximum revised posterior probability as described user's to be excavated Optimum real name；

Wherein, described power rule is adjusted to include: the frequency of occurrence of candidate's real name and the mapping relations of corrected parameter, The weight of identical full pinyin and the mapping relations of corrected parameter, the character complexity of candidate's real name and corrected parameter Mapping relations, character length and the mapping relations of corrected parameter of candidate's real name, surname popularity with At least one mapping relations in the mapping relations of corrected parameter.

If being judged as NO, then practise according to user's remarks real name that described each candidate user remark information is the most corresponding The posterior probability of used value and described each candidate's real name correspondence respectively, calculates described each candidate's real name correspondence respectively Weight order value, and using candidate's real name of maximum weight order value as the optimum of described user to be excavated Real name；

Wherein, described user's remarks real name custom value refers to that user carries out the user in remarks for real name to good friend Remark information quantity and this user carry out the ratio of the quantity of all user's remark informations of remarks to good friend.

If being judged as NO, then select the candidate that candidate's real name of maximum with second largest posterior probability is the most corresponding User's remark information, and to selected go out candidate user remark information carry out feature extraction, and according to extraction Described feature and preset sequence rank model candidate's real name of the maximum and second largest posterior probability is carried out Scoring, and using the high candidate's real name optimum real name as described user to be excavated of marking.

In one embodiment, described processor 1001 also performs:

Obtain the multiple training user remarks for train rank model corresponding with the user of known users real name Information, and using training user's remark information identical in the plurality of training user's remark information as instruction Practice candidate's real name；

Using each training user's remark information corresponding to training candidate's real name of described user's real name as first Support set；Described first supports corresponding first scoring values of set；

Corresponding to the training candidate's real name for non-described user's real name and the full pinyin with described user's real name Each training user's remark information support set as second；Described second supports corresponding second goals for of set Value, described first scoring values is more than described second scoring values；

Extract the described first feature supporting set and described second and support the feature of set, and according to described the One feature supporting set and described first scoring values, the feature and described second of described second support set Scoring values is set up and trains rank model；

Wherein, described first in the rank model after training supports set and described second and support set is to use In the candidate's real name inputted is marked.

One of ordinary skill in the art will appreciate that all or part of flow process realizing in above-described embodiment method, Can be by computer program and complete to instruct relevant hardware, described program can be stored in a calculating In machine read/write memory medium, this program is upon execution, it may include such as the flow process of the embodiment of above-mentioned each method. Wherein, described storage medium can be magnetic disc, CD, read-only store-memory body (Read-Only Memory, Or random store-memory body (Random Access Memory, RAM) etc. ROM).

Above disclosed be only present pre-ferred embodiments, certainly can not with this limit the present invention it Interest field, the equivalent variations therefore made according to the claims in the present invention, still belong to the scope that the present invention is contained.

Claims

1. a data mining processing method, it is characterised in that including:

2. the method for claim 1, it is characterised in that described acquisition is corresponding with user to be excavated Multiple user's remark informations, and mining analysis goes out at least one candidate use in the plurality of user's remark information Family remark information, and by candidate user remark information identical at least one candidate user remark information described Respectively as candidate's real name, including:

3. the method for claim 1, it is characterised in that described according to each candidate user remark information The most corresponding phonetic, adds up the frequency of occurrence that each identical phonetic is the most corresponding, and according to described each identical spelling The frequency of occurrence that cent is not corresponding distinguishes the most corresponding frequency of occurrence with described each candidate's real name, calculates described each time Select the posterior probability that real name is the most corresponding, including:

4. method as claimed in claim 3, it is characterised in that described according to each identical surname phonetic difference The frequency of occurrence of corresponding frequency of occurrence, each same name phonetic correspondence respectively and candidate user remark information Total amount, calculates the joint probability that each identical full pinyin is the most corresponding, including:

5. the method for claim 1, it is characterised in that the described candidate by maximum posterior probability Real name as the optimum real name of described user to be excavated, including:

6. the method for claim 1, it is characterised in that the described candidate by maximum posterior probability Real name as the optimum real name of described user to be excavated, including:

7. the method for claim 1, it is characterised in that the described candidate by maximum posterior probability Real name as the optimum real name of described user to be excavated, including:

8. method as claimed in claim 7, it is characterised in that also include:

9. a data mining processing means, it is characterised in that including:

10. device as claimed in claim 9, it is characterised in that described acquisition is excavated module and included:

Obtain screening unit, for obtaining the multiple user remark informations corresponding with user to be excavated, and according to Name tactical rule and the surname matching list preset filter out in the plurality of user's remark information and meet surname First kind user's remark information of condition；

Deletion determines unit, for comprising proper noun and/or high frequency in described first kind user's remark information User's remark information of word is deleted, and remaining first kind user's remark information is defined as at least after deleting One candidate user remark information, and candidate identical at least one candidate user remark information described is used Family remark information is respectively as candidate's real name.

11. devices as claimed in claim 9, it is characterised in that described computing module includes:

Phonetic acquiring unit, for obtaining the full pinyin that each candidate user remark information is respectively corresponding, described entirely Phonetic includes surname phonetic and name phonetic；

Frequency statistics unit, divides for adding up each identical surname phonetic according to described each candidate user remark information The frequency of occurrence that not corresponding frequency of occurrence is the most corresponding with each same name phonetic；

First probability calculation unit, for the frequency of occurrence the most corresponding according to each identical surname phonetic, each phase The frequency of occurrence the most corresponding with name phonetic and candidate user remark information total amount, calculate each identical spelling The joint probability that cent is not corresponding；

Second probability calculation unit, for the appearance frequency that the identical full pinyin according to maximum joint probability is corresponding Secondary and described each candidate's real name distinguishes corresponding frequency of occurrence, calculates described each candidate's real name correspondence respectively Posterior probability.

12. devices as claimed in claim 11, it is characterised in that described first probability calculation unit includes:

First probability calculation subelement, for the frequency of occurrence respectively corresponding according to each identical surname phonetic and Candidate user remark information total amount, calculates the first probability that described each identical surname phonetic is the most corresponding；

Second probability calculation subelement, for the frequency of occurrence respectively corresponding according to each same name phonetic and Candidate user remark information total amount, calculates the second probability that described each same name phonetic is the most corresponding；

Joint probability calculation subelement, for each described first probability and each described second probability are calculated, The joint probability the most corresponding to obtain each identical full pinyin.

13. devices as claimed in claim 9, it is characterised in that described determine that module includes:

First judging unit, for judging that whether the posterior probability of maximum is more than predetermined probabilities threshold value；

First determines unit, if being judged as YES, then by the posteriority of described maximum for described first judging unit Candidate's real name of probability is as the optimum real name of described user to be excavated；

Correction determines unit, if being judged as NO for described first judging unit, then according to the tune power rule preset Then the posterior probability that described each candidate's real name is corresponding respectively is modified, and by maximum revised posteriority Candidate's real name of probability is as the optimum real name of described user to be excavated；

14. devices as claimed in claim 9, it is characterised in that described determine that module includes:

Second judging unit, for judging that whether the posterior probability of maximum is more than predetermined probabilities threshold value；

Second determines unit, if being judged as YES, then by the posteriority of described maximum for described second judging unit Candidate's real name of probability is as the optimum real name of described user to be excavated；

Weight calculation determines unit, if being judged as NO, then according to described each time for described second judging unit Select user's remarks real name custom value and described each candidate's real name correspondence respectively that family remark information is the most corresponding Posterior probability, calculate the weight order value that described each candidate's real name is respectively corresponding, and by maximum sequence power Candidate's real name of weight values is as the optimum real name of described user to be excavated；

15. devices as claimed in claim 9, it is characterised in that described determine that module includes:

3rd judging unit, for judging that whether the posterior probability of maximum is more than predetermined probabilities threshold value；

3rd determines unit, if being judged as YES, then by the posteriority of described maximum for described 3rd judging unit Candidate's real name of probability is as the optimum real name of described user to be excavated；

Model score determines unit, if being judged as NO for described 3rd judging unit, then selects maximum and the The candidate user remark information that candidate's real name of two big posterior probability is the most corresponding, and to selected go out time Family remark information is selected to carry out feature extraction, and according to the described feature extracted and the sequence rank model of presetting Candidate's real name of the maximum and second largest posterior probability is marked, and using the high candidate's real name of scoring as The optimum real name of described user to be excavated.

16. devices as claimed in claim 15, it is characterised in that also include:

Acquisition determines module, trains rank model for obtaining corresponding with the user of known users real name being used for Multiple training user's remark informations, and by identical training user in the plurality of training user's remark information Remark information is respectively as training candidate's real name；

Set determine module, for by for described user's real name training candidate's real name corresponding to respectively train use Family remark information supports set as first；Described first supports corresponding first scoring values of set；

Described set determines module, is additionally operable to for non-described user's real name and have the complete of described user's real name Each training user's remark information corresponding to training candidate's real name of phonetic supports set as second；Described Two support corresponding second scoring values of set, and described first scoring values is more than described second scoring values；

Model training module, supports set for extracting the described first feature supporting set and described second Feature, and according to the described first feature supporting set and described first scoring values, described second support collection The feature and described second scoring values that close are set up and train rank model；