CN106021235B - A kind of data mining processing method and device - Google Patents
A kind of data mining processing method and device Download PDFInfo
- Publication number
- CN106021235B CN106021235B CN201610387322.5A CN201610387322A CN106021235B CN 106021235 B CN106021235 B CN 106021235B CN 201610387322 A CN201610387322 A CN 201610387322A CN 106021235 B CN106021235 B CN 106021235B
- Authority
- CN
- China
- Prior art keywords
- user
- candidate
- real name
- remark information
- phonetic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000007418 data mining Methods 0.000 title claims abstract description 25
- 238000003672 processing method Methods 0.000 title claims abstract description 15
- 238000000034 method Methods 0.000 claims abstract description 22
- 238000004458 analytical method Methods 0.000 claims abstract description 21
- 238000005065 mining Methods 0.000 claims abstract description 21
- 238000012549 training Methods 0.000 claims description 49
- 238000004364 calculation method Methods 0.000 claims description 40
- 238000013507 mapping Methods 0.000 claims description 40
- 238000012216 screening Methods 0.000 claims description 26
- 238000012217 deletion Methods 0.000 claims description 11
- 230000037430 deletion Effects 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 11
- 239000000284 extract Substances 0.000 claims description 2
- 238000010586 diagram Methods 0.000 description 20
- 238000005520 cutting process Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 9
- 238000003860 storage Methods 0.000 description 6
- KLCDQSGLLRINHY-VHEBQXMUSA-N Yellow AB Chemical compound NC1=CC=C2C=CC=CC2=C1\N=N\C1=CC=CC=C1 KLCDQSGLLRINHY-VHEBQXMUSA-N 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000009412 basement excavation Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 235000012054 meals Nutrition 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/226—Validation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Abstract
The embodiment of the invention discloses a kind of data mining processing method and devices, wherein method includes: to obtain multiple user's remark informations corresponding with user to be excavated, and mining analysis goes out at least one candidate user remark information in the multiple user's remark information, and by identical candidate user remark information at least one described candidate user remark information respectively as candidate real name;According to the corresponding phonetic of each candidate user remark information, count the corresponding frequency of occurrence of each identical phonetic, and according to each identical corresponding frequency of occurrence of phonetic and the corresponding frequency of occurrence of each candidate's real name, the corresponding posterior probability of each candidate's real name is calculated;Using the candidate real name of maximum posterior probability as the optimal real name of the user to be excavated.Using the present invention, it can accurately analyze and identify user's real name, to enrich the function of social networks.
Description
Technical field
The present invention relates to Internet technical field more particularly to a kind of data mining processing method and devices.
Background technique
With the development of internet technology, more and more users can participate in social networks.User is being added to social activity
Before network, need first to carry out user's registration, and the user name registered can be the character that user arbitrarily inputs, i.e., user infuses
The real name of user can not included in volume information.And to carry out security monitoring in social networks, then need the real name of user
Just can recognize that whether user is fraudulent user;For another example to carry out accurate crowd's excavation in social networks, then it is also required to use
To the real name of user.But for current social networks, can only independently be provided by user to obtain the real name of user, and work as
When user is reluctant to provide real name, the server side of social networks is can not to learn the real name of the user, so as to cause social networks
Partial function can not fully achieve.
Summary of the invention
The embodiment of the present invention provides a kind of data mining processing method and device, can accurately analyze and identify user's reality
Name, to enrich the function of social networks.
The embodiment of the invention provides a kind of data mining processing methods, comprising:
Multiple user's remark informations corresponding with user to be excavated are obtained, and are excavated in the multiple user's remark information
At least one candidate user remark information is analyzed, and candidate identical at least one described candidate user remark information is used
Family remark information is respectively as candidate real name;
According to the corresponding phonetic of each candidate user remark information, the corresponding appearance frequency of each identical phonetic is counted
It is secondary, and according to the corresponding frequency of occurrence of each identical phonetic and the corresponding frequency of occurrence of each candidate real name,
Calculate the corresponding posterior probability of each candidate's real name;
Using the candidate real name of maximum posterior probability as the optimal real name of the user to be excavated.
Correspondingly, the embodiment of the invention also provides a kind of data mining processing units, comprising:
It obtains and excavates module, for obtaining multiple user's remark informations corresponding with user to be excavated, and the multiple
Mining analysis goes out at least one candidate user remark information in user's remark information, and will at least one described candidate user remarks
Identical candidate user remark information is respectively as candidate real name in information;
Computing module, for counting each identical phonetic difference according to the corresponding phonetic of each candidate user remark information
Corresponding frequency of occurrence, and respectively corresponded according to each identical corresponding frequency of occurrence of phonetic and each candidate real name
Frequency of occurrence, calculate the corresponding posterior probability of each candidate real name;
Determining module, for using the candidate real name of maximum posterior probability as the optimal real name of the user to be excavated.
The embodiment of the present invention by multiple user's remark informations mining analysis go out at least one candidate user remarks letter
Breath, and will at least one described candidate user remark information identical candidate user remark information respectively as candidate real name,
And according to the corresponding phonetic of each candidate user remark information, the corresponding frequency of occurrence of each identical phonetic, and root are counted
According to the corresponding frequency of occurrence of each identical phonetic and the corresponding frequency of occurrence of each candidate real name, described in calculating
The corresponding posterior probability of each candidate's real name, finally using the candidate real name of maximum posterior probability as the user to be excavated
Optimal real name, use is accurately analyzed based on user's remark information in the case where user does not provide real name so as to realize
The real name at family, and then the various functions of social networks can be enriched based on the real name analyzed.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is a kind of flow diagram of data mining processing method provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of another data mining processing method provided in an embodiment of the present invention;
Fig. 3 is the flow diagram of another data mining processing method provided in an embodiment of the present invention;
Fig. 4 is the flow diagram of another data mining processing method provided in an embodiment of the present invention;
Fig. 5 is a kind of structural schematic diagram of data mining processing unit provided in an embodiment of the present invention;
Fig. 6 is a kind of structural schematic diagram for obtaining excavation module provided in an embodiment of the present invention;
Fig. 7 is a kind of structural schematic diagram of computing module provided in an embodiment of the present invention;
Fig. 8 is a kind of structural schematic diagram of first probability calculation unit provided in an embodiment of the present invention;
Fig. 9 is a kind of structural schematic diagram of determining module provided in an embodiment of the present invention;
Figure 10 is the structural schematic diagram of another data mining processing unit provided in an embodiment of the present invention;
Figure 11 is a kind of structural schematic diagram of server provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
It referring to Figure 1, is a kind of flow diagram of data mining processing method provided in an embodiment of the present invention, the side
Method may include:
S101 obtains multiple user's remark informations corresponding with user to be excavated, and in the multiple user's remark information
Middle mining analysis goes out at least one candidate user remark information, and will be identical at least one described candidate user remark information
Candidate user remark information is respectively as candidate real name;
Specifically, the available multiple user's remarks letters corresponding with user to be excavated of the server based on social networks
Breath, wherein the user to be excavated refers to that server need to analyze the user for identifying its true real name, and the multiple user is standby
Note information refers to that other good friend users carry out the information of remarks to the user to be excavated.For example, the user to be excavated has 100
A good friend user, 100 good friend users have 75 good friend users to carry out remarks to the user to be excavated, then can be 75 by this
The information of good friend institute remarks is as the multiple user's remark information.The server is further believed in the multiple user's remarks
Mining analysis goes out at least one candidate user remark information in breath, and will be identical at least one described candidate user remark information
Candidate user remark information respectively as candidate real name.For example, having 20 at least one described candidate user remark information
It is " yellow AC ", 15 candidate user remark informations that candidate user remark information, which is " king AB ", 3 candidate user remark informations,
It is " yellow AB ", 30 candidate user remark informations is " king AC ", then it can be by " king AB ", " yellow AC ", " yellow AB ", " king AC "
As the candidate real name.
Wherein, server mining analysis in the multiple user's remark information goes out at least one candidate user remarks
The detailed process of information can be with are as follows: obtains multiple user's remark informations corresponding with user to be excavated, and according to name structure rule
Then filtered out in the multiple user's remark information with preset surname matching list meet surname condition first kind user it is standby
Infuse information;It will include user's remark information deletion of proper noun and/or high frequency words in the first kind user remark information, and
First kind user's remark information remaining after deletion is determined as at least one candidate user remark information.Wherein, described special
Having noun may include such as teacher, master worker, sir, the proprietary role's word of Miss, the high frequency words may include as tomorrow, after
It, have a meal, drink water the contour existing word that occurs frequently.For example, can be determined if some first kind user's remark information is " teacher Wang "
First kind user's remark information includes therefore proper noun can delete first kind user's remark information.
Wherein, the name tactical rule can criticize the number of words of normal name, and such as normal name is generally 2 to 4 Chinese characters
(name of monosyllabic name is 2 to 3 Chinese characters, and the name of two-character surname is 3 to 4 Chinese characters).Therefore, described according to name tactical rule and pre-
If surname matching list the first kind user's remark information for meeting surname condition is filtered out in the multiple user's remark information
Detailed process can be with are as follows: the server can first based on effective word cutting algorithm to the multiple users got for information progress
Word cutting (if user's remark information is " he is king AB ", then user's remark information after word cutting becomes " king AB "), then will include 2
User's remark information after to the word cutting of 4 Chinese characters screens, and preliminary screening user's remark information is obtained, later further according to pre-
If surname matching list in monosyllabic name set preliminary screening user's remark information comprising 2 words is matched, with detect packet
First Chinese character of preliminary screening user's remark information containing 2 words whether there is in the monosyllabic name set, and if it exists, then really
Surely preliminary screening user's remark information comprising 2 words meets surname condition and as first kind user's remark information, no
Then rejected;And it is standby to the preliminary screening user comprising 4 words according to the two-character surname set in preset surname matching list simultaneously
Note information is matched, and whether there is with detecting the first two Chinese character of preliminary screening user's remark information comprising 4 words in institute
It states in two-character surname set, and if it exists, then determine that preliminary screening user's remark information comprising 4 words meets surname condition and made
For first kind user's remark information, otherwise rejected;And simultaneously according to the monosyllabic name set and the two-character surname set to including 3
Preliminary screening user's remark information of a word matches, to detect the of preliminary screening user's remark information comprising 3 words
One Chinese character, which whether there is, whether there is in the monosyllabic name set or the first two Chinese character in the two-character surname set, as long as detecting full
The one of condition of foot, it can determine that preliminary screening user's remark information comprising 3 words meets surname condition and made
For first kind user's remark information, rejected if being all unsatisfactory for.
It is corresponding out to count each identical phonetic according to the corresponding phonetic of each candidate user remark information by S102
The existing frequency, and according to each identical corresponding frequency of occurrence of phonetic and the corresponding appearance frequency of each candidate's real name
It is secondary, calculate the corresponding posterior probability of each candidate's real name;
Specifically, each candidate user remarks at least one available described candidate user remark information of the server
The corresponding full pinyin of information, the full pinyin include surname phonetic and name phonetic.For example, some candidate user remarks is believed
Breath is " Zhang Xiaobo ", then corresponding full pinyin is " zhang xiaobo ", wherein surname phonetic is " zhang ", and name phonetic is
"xiaobo".The server is corresponding out further according to each identical surname phonetic of each candidate user remark information statistics
The existing frequency and the corresponding frequency of occurrence of each same name phonetic, for example, at least one described candidate user remark information packet
20 " Zhang Xiaobo ", 25 " Zhang Xiaobo ", 10 " Wang Xiafangs " and 5 " Zhang Haibo " are included, then can learn that identical surname is spelled
Sound includes " zhang " and " wang ", and same name phonetic includes " xiaobo " and " haibo ", mutually of the same surname so as to count
The frequency of occurrence of family name's phonetic " zhang " is 50, and the frequency of occurrence of identical surname phonetic " wang " is 10, same name phonetic
The frequency of occurrence of " xiaobo " is 55, and the frequency of occurrence of same name phonetic " haibo " is 5.Hereafter, the server further according to
The corresponding frequency of occurrence of each identical surname phonetic, the corresponding frequency of occurrence of each same name phonetic and candidate user
Remark information total amount calculates the corresponding joint probability of each identical full pinyin, and according to the identical complete of maximum joint probability
The corresponding frequency of occurrence of phonetic and the corresponding frequency of occurrence of each candidate's real name, calculate each candidate real name difference
Corresponding posterior probability.
Wherein, described to be respectively corresponded according to the corresponding frequency of occurrence of each identical surname phonetic, each same name phonetic
Frequency of occurrence and candidate user remark information total amount, calculate the specific mistake of the corresponding joint probability of each identical full pinyin
Journey are as follows: according to the corresponding frequency of occurrence of each identical surname phonetic and candidate user remark information total amount, calculate described each
Corresponding first probability of identical surname phonetic;According to the corresponding frequency of occurrence of each same name phonetic and candidate use
Family remark information total amount calculates corresponding second probability of each same name phonetic;To each first probability and respectively
Second probability is calculated, to obtain the corresponding joint probability of each identical full pinyin.
Wherein, the calculation formula of the joint probability are as follows: joint probability PFull pinyin=PSurname phonetic*PName phonetic, PSurname phoneticIt is as described
First probability, PName phoneticAs described second probability.The calculation formula of the posterior probability are as follows: and posterior probability P (candidate real name | most
Good full pinyin)=candidate real name in best full pinyin frequency of occurrence/best full pinyin frequency of occurrence, it is described best complete
Phonetic refers to the identical full pinyin of maximum joint probability, wherein if the full pinyin of candidate real name is not the best full pinyin,
Then the frequency of occurrence of candidate's real name in best full pinyin is 0.For example, at least one described candidate user remark information packet
30 " Wu Xiaobo ", 20 " Wu little Bo ", 10 " Wu Xiaobo ", 10 " Zhang Xiaobo " and 30 " Zhang Haibo " are included, wherein phase
Include " wu xiaobo ", " zhang xiaobo ", " zhang haibo " with full pinyin, wherein identical surname phonetic " wu "
PSurname phoneticThe frequency of occurrence of=" wu "/candidate user remark information total amount=60/100, the P of identical surname phonetic " zhang "Surname phonetic
The frequency of occurrence of=" zhang "/candidate user remark information total amount=40/100, the P of same name phonetic " xiaobo "Name phonetic
The frequency of occurrence of=" xiaobo "/candidate user remark information total amount=70/100, the P of same name phonetic " haibo "Name phonetic
The frequency of occurrence of=" haibo "/candidate user remark information total amount=30/100;So as to calculate identical full pinyin " wu
The joint probability P of xiaobo "Full pinyinThe P of=identical surname phonetic " wu "Surname phonetic* the P of same name phonetic " xiaobo "Name phonetic=
42/100, the joint probability P of identical full pinyin " zhang xiaobo "Full pinyinThe P of=identical surname phonetic " zhang "Surname phonetic* phase
With the P of name phonetic " xiaobo "Name phonetic=28/100, the joint probability P of identical full pinyin " zhang haibo "Full pinyin=identical
The P of surname phonetic " zhang "Surname phonetic* the P of same name phonetic " haibo "Name phonetic=12/100;It can be seen that identical full pinyin
The joint probability of " wu xiaobo " is maximum, therefore, identical full pinyin " wu xiaobo " is used as best full pinyin;Further may be used
To calculate the posterior probability P (Wu Xiaobo | best full pinyin " wu xiaobo ")=30/60 of " Wu Xiaobo ", after " Wu little Bo "
Test probability P (Wu little Bo | best full pinyin " wu xiaobo ")=20/60, and the posterior probability P of " Wu Xiaobo " (Wu Xiaobo | it is best
Full pinyin " wu xiaobo ")=10/60, the posterior probability P of " Zhang Xiaobo " (Zhang Xiaobo | best full pinyin " wu xiaobo ")=
0, the posterior probability P of " Zhang Haibo " (Zhang Haibo | best full pinyin " wu xiaobo ")=0.
S103, using the candidate real name of maximum posterior probability as the optimal real name of the user to be excavated;
It, can will be maximum specifically, after the server calculates the corresponding posterior probability of each candidate real name
Posterior probability optimal real name of the candidate real name as the user to be excavated, it can the optimal real name is determined as institute
State user to be excavated is really real name, so as to realize that the real name to user accurately identifies.For example, candidate real name includes
" Wu Xiaobo ", " Wu little Bo ", " Wu Xiaobo ", " Zhang Xiaobo ", " Zhang Haibo ", wherein the posterior probability of " Wu Xiaobo " be 30/60,
The posterior probability of " Wu little Bo " be 20/60, " Wu Xiaobo " posterior probability be 10/60, " Zhang Xiaobo " posterior probability be 0, "
The posterior probability of hypo " is 0, then " Wu Xiaobo " of maximum posterior probability can be determined as the optimal of the user to be excavated
Real name.
The embodiment of the present invention by multiple user's remark informations mining analysis go out at least one candidate user remarks letter
Breath, and will at least one described candidate user remark information identical candidate user remark information respectively as candidate real name,
And according to the corresponding phonetic of each candidate user remark information, the corresponding frequency of occurrence of each identical phonetic, and root are counted
According to the corresponding frequency of occurrence of each identical phonetic and the corresponding frequency of occurrence of each candidate real name, described in calculating
The corresponding posterior probability of each candidate's real name, finally using the candidate real name of maximum posterior probability as the user to be excavated
Optimal real name, use is accurately analyzed based on user's remark information in the case where user does not provide real name so as to realize
The real name at family, and then the various functions of social networks can be enriched based on the real name analyzed.
Fig. 2 is referred to, is the flow diagram of another data mining processing method provided in an embodiment of the present invention, it is described
Method may include:
S201 obtains multiple user's remark informations corresponding with user to be excavated, and in the multiple user's remark information
Middle mining analysis goes out at least one candidate user remark information, and will be identical at least one described candidate user remark information
Candidate user remark information is respectively as candidate real name;
It is corresponding out to count each identical phonetic according to the corresponding phonetic of each candidate user remark information by S202
The existing frequency, and according to each identical corresponding frequency of occurrence of phonetic and the corresponding appearance frequency of each candidate's real name
It is secondary, calculate the corresponding posterior probability of each candidate's real name;
Wherein, the specific implementation of S201 to S202 step may refer to the S101 in above-mentioned Fig. 1 corresponding embodiment extremely
S102 is not discussed here.
S203, judges whether maximum posterior probability is greater than predetermined probabilities threshold value;
Specifically, after the server calculates the corresponding posterior probability of each candidate's real name, it can be further
Judge whether maximum posterior probability is greater than predetermined probabilities threshold value.
S204, using the candidate real name of the maximum posterior probability as the optimal real name of the user to be excavated;
Specifically, illustrating that the maximum posterior probability has enough confidence levels therefore can if S203 is judged as YES
Using the optimal real name by the candidate real name of the maximum posterior probability as the user to be excavated, to guarantee the optimal reality
Name is exactly the real real name of the user to be excavated.
S205 is modified the corresponding posterior probability of each candidate real name according to preset tune power rule, and
Using the candidate real name of maximum revised posterior probability as the optimal real name of the user to be excavated;
Specifically, the server can weigh rule to each candidate according to preset tune if S203 is judged as NO
The corresponding posterior probability of real name is modified, and using the candidate real name of maximum revised posterior probability as it is described to
Excavate the optimal real name of user.It is described to adjust mapping relations of the frequency of occurrence with corrected parameter, the phase for weighing that rule includes: candidate real name
With the weight and the mapping relations of corrected parameter of full pinyin, the mapping relations of the character complexity of candidate real name and corrected parameter,
In the mapping relations of the mapping relations of the character length of candidate real name and corrected parameter, the popularity of surname and corrected parameter
At least one mapping relations.The frequency of occurrence of candidate's real name and the mapping relations of corrected parameter refer to multiple and different appearance
Mapping relations between frequency range and multiple and different corrected parameters, the corresponding bigger amendment ginseng of bigger frequency of occurrence range
Number, and be then negative for the corresponding corrected parameter of frequency of occurrence range lower than frequency threshold value, the appearance frequency of such as candidate real name A
The secondary frequency of occurrence than candidate real name B is more, then the corresponding corrected parameter of candidate real name A is bigger, i.e. the corresponding posteriority of candidate's real name A
Probability will will increase more numerical value;For another example the frequency of occurrence of candidate real name C is lower than frequency threshold value, then needs to reduce candidate real name
The corresponding posterior probability of C.The weight of the identical full pinyin and the mapping relations of corrected parameter refer to multiple and different weight models
Enclose the mapping relations between multiple and different corrected parameters, the corresponding bigger corrected parameter of bigger proportion range, and for
Corrected parameter corresponding lower than the proportion range of weight threshold then can be negative, and the quantity occupancy family such as certain identical full pinyin is standby
The ratio for infusing informational capacity is bigger, then the weight of the identical full pinyin is bigger, then the corresponding corrected parameter of the identical full pinyin is just
It is bigger, it can to rise to the corresponding posterior probability of multiple candidate's real names of the identical full pinyin.Candidate's real name
Character complexity and the mapping relations of corrected parameter refer to multiple and different character complexity and multiple and different corrected parameters it
Between mapping relations, the corresponding bigger corrected parameter of bigger character complexity, if some candidate real name includes to be difficult to write and not
The Chinese character of common (i.e. biggish character complexity), then candidate's real name can correspond to biggish corrected parameter, it can substantially
Improve the corresponding posterior probability of candidate's real name.The character length of candidate's real name and the mapping relations of corrected parameter refer to more
Mapping relations between a different character length and multiple and different corrected parameters, longer character length correspond to bigger repair
The character length of positive parameter, such as candidate real name A is greater than the character length of candidate real name B, then candidate real name A can be corresponded to bigger
Corrected parameter, it can more greatly improve the corresponding posterior probability of candidate real name A.The popularity and amendment of the surname
The mapping relations of parameter refer to the mapping relations between multiple and different surname popularitys and multiple and different corrected parameters, more
The corresponding corrected parameter of universal surname is bigger, and then can be with for the corresponding corrected parameter of surname lower than popularity threshold value
For negative, as the corresponding corrected parameter of surname " king " corrected parameter more corresponding than surname " Ouyang " is big.Therefore, the server can
It is right respectively to each candidate real name to weigh the combination of one of rule mapping relations or a variety of mapping relations according to the tune
The posterior probability answered is modified (modified process can be increase posterior probability, be also possible to reduce posterior probability), and will
Optimal real name of the candidate real name of maximum revised posterior probability as the user to be excavated.
The embodiment of the present invention by multiple user's remark informations mining analysis go out at least one candidate user remarks letter
Breath, and will at least one described candidate user remark information identical candidate user remark information respectively as candidate real name,
And according to the corresponding phonetic of each candidate user remark information, the corresponding frequency of occurrence of each identical phonetic, and root are counted
According to the corresponding frequency of occurrence of each identical phonetic and the corresponding frequency of occurrence of each candidate real name, described in calculating
The corresponding posterior probability of each candidate's real name, and when maximum posterior probability is greater than predetermined probabilities threshold value, can will be maximum
Posterior probability optimal real name of the candidate real name as the user to be excavated, do not provide reality in user so as to realize
The real name of user is accurately analyzed in the case where name based on user's remark information, and then can be abundant based on the real name analyzed
The various functions of social networks;It, can also be further and when maximum posterior probability is less than or equal to predetermined probabilities threshold value
The corresponding posterior probability of each candidate real name is modified according to preset tune power rule, and will be after maximum amendment
Posterior probability optimal real name of the candidate real name as the user to be excavated, so as to further increase the knowledge to real name
Other accuracy.
Fig. 3 is referred to, is the flow diagram of another data mining processing method provided in an embodiment of the present invention, it is described
Method may include:
S301 obtains multiple user's remark informations corresponding with user to be excavated, and in the multiple user's remark information
Middle mining analysis goes out at least one candidate user remark information, and will be identical at least one described candidate user remark information
Candidate user remark information is respectively as candidate real name;
It is corresponding out to count each identical phonetic according to the corresponding phonetic of each candidate user remark information by S302
The existing frequency, and according to each identical corresponding frequency of occurrence of phonetic and the corresponding appearance frequency of each candidate's real name
It is secondary, calculate the corresponding posterior probability of each candidate's real name;
Wherein, the specific implementation of S301 to S302 step may refer to the S101 in above-mentioned Fig. 1 corresponding embodiment extremely
S102 is not discussed here.
S303, judges whether maximum posterior probability is greater than predetermined probabilities threshold value;
Specifically, after the server calculates the corresponding posterior probability of each candidate's real name, it can be further
Judge whether maximum posterior probability is greater than predetermined probabilities threshold value.
S304, using the candidate real name of the maximum posterior probability as the optimal real name of the user to be excavated;
Specifically, illustrating that the maximum posterior probability has enough confidence levels therefore can if S303 is judged as YES
Using the optimal real name by the candidate real name of the maximum posterior probability as the user to be excavated, to guarantee the optimal reality
Name is exactly the real real name of the user to be excavated.
S305, according to the corresponding user's remarks real name habit value of each candidate user remark information and described each
The corresponding posterior probability of candidate real name calculates the corresponding weight order value of each candidate real name, and will be maximum
Optimal real name of the candidate real name of weight order value as the user to be excavated;
Specifically, if S303 is judged as NO, the available each candidate user remark information difference of server
The remarks attribute of corresponding user's (user of remarks is carried out to the user to be excavated), the remarks attribute of a user include
The user carries out good friend to carry out the useful of remarks to good friend for user's remark information quantity of real name and the user in remarks
The quantity of family remark information, the server are right respectively further according to remarks attribute calculating each candidate user remark information
The user's remarks real name habit value answered, wherein user's remarks real name habit value refers to that user carries out good friend to be in remarks
User's remark information quantity of real name and the user carry out the ratio of the quantity of all user's remark informations of remarks to good friend.Example
Such as, the corresponding user of some candidate user remark information (user of remarks is carried out to the user to be excavated) is user A, if
The quantity that user A carries out remarks all user's remark informations generated to other people is 100, and this 100 user's remark informations
In have 70 user's remark informations be real real name, then can calculate user A user's remarks real name habit value be 70/
100.It, can after the server calculates the corresponding user's remarks real name habit value of each candidate user remark information
According to each corresponding user's remarks real name habit value of candidate user remark information and each candidate real name point
Not corresponding posterior probability, calculates the corresponding weight order value of each candidate real name, and by maximum weight order value
Optimal real name of the candidate real name as the user to be excavated.
Wherein, described according to the corresponding user's remarks real name habit value of each candidate user remark information and institute
The corresponding posterior probability of each candidate real name is stated, the specific mistake of the corresponding weight order value of each candidate's real name is calculated
Journey can be with are as follows: by taking one of candidate real name A as an example, the server can be by the corresponding multiple candidate users of candidate real name A
Remark information (content of this multiple candidate user remark information is candidate real name A) is determined as multiple target candidate user remarks
Then information calculates being averaged for the multiple corresponding user's remarks real name habit value of target candidate user remark information
Value;Average value posterior probability corresponding with candidate real name A is added to obtain corresponding weight order value again, or
The average value can be weighed plus corresponding sequence is obtained after a certain coefficient multiplied by the corresponding posterior probability of candidate real name A
Weight values, other candidate's real names are all based on identical Computing Principle and calculate corresponding weight order value.
Optionally, if S205 calculated maximum revised posterior probability of institute in above-mentioned Fig. 2 corresponding embodiment according to
So it is less than the predetermined probabilities threshold value, then can calculates the corresponding sequence power of revised posterior probability with the Computing Principle of S305
Weight values, more accurately to determine optimal real name.
It optionally, can if the calculated maximum weight order value of S305 institute is still less than the predetermined probabilities threshold value
Weight order value is modified with the Computing Principle with the S205 in above-mentioned Fig. 2 corresponding embodiment, more accurately to determine
Optimal real name.
The embodiment of the present invention by multiple user's remark informations mining analysis go out at least one candidate user remarks letter
Breath, and will at least one described candidate user remark information identical candidate user remark information respectively as candidate real name,
And according to the corresponding phonetic of each candidate user remark information, the corresponding frequency of occurrence of each identical phonetic, and root are counted
According to the corresponding frequency of occurrence of each identical phonetic and the corresponding frequency of occurrence of each candidate real name, described in calculating
The corresponding posterior probability of each candidate's real name, and when maximum posterior probability is greater than predetermined probabilities threshold value, can will be maximum
Posterior probability optimal real name of the candidate real name as the user to be excavated, do not provide reality in user so as to realize
The real name of user is accurately analyzed in the case where name based on user's remark information, and then can be abundant based on the real name analyzed
The various functions of social networks;It, can also be further and when maximum posterior probability is less than or equal to predetermined probabilities threshold value
According to each corresponding user's remarks real name habit value of candidate user remark information and each candidate real name difference
Corresponding posterior probability, calculates the corresponding weight order value of each candidate real name, and by maximum weight order value
Optimal real name of the candidate real name as the user to be excavated, so as to further increase the identification accuracy to real name.
Fig. 4 is referred to, is the flow diagram of another data mining processing method provided in an embodiment of the present invention, it is described
Method may include:
S401 obtains multiple user's remark informations corresponding with user to be excavated, and in the multiple user's remark information
Middle mining analysis goes out at least one candidate user remark information, and will be identical at least one described candidate user remark information
Candidate user remark information is respectively as candidate real name;
It is corresponding out to count each identical phonetic according to the corresponding phonetic of each candidate user remark information by S402
The existing frequency, and according to each identical corresponding frequency of occurrence of phonetic and the corresponding appearance frequency of each candidate's real name
It is secondary, calculate the corresponding posterior probability of each candidate's real name;
Wherein, the specific implementation of S401 to S402 step may refer to the S101 in above-mentioned Fig. 1 corresponding embodiment extremely
S102 is not discussed here.
S403, judges whether maximum posterior probability is greater than predetermined probabilities threshold value;
Specifically, after the server calculates the corresponding posterior probability of each candidate's real name, it can be further
Judge whether maximum posterior probability is greater than predetermined probabilities threshold value.
S404, using the candidate real name of the maximum posterior probability as the optimal real name of the user to be excavated;
Specifically, illustrating that the maximum posterior probability has enough confidence levels therefore can if S403 is judged as YES
Using the optimal real name by the candidate real name of the maximum posterior probability as the user to be excavated, to guarantee the optimal reality
Name is exactly the real real name of the user to be excavated.
S405, the maximum candidate user remark information corresponding with the candidate real name of second largest posterior probability of selection,
And feature extraction is carried out to selected candidate user remark information out, and according to the feature and preset sequence of extraction
Rank model scores to the candidate real name of maximum and second largest posterior probability, and the high candidate real name that will score is as institute
State the optimal real name of user to be excavated;
Specifically, the server can choose the time of maximum and second largest posterior probability if S403 is judged as NO
The corresponding candidate user remark information of real name is selected, and feature extraction is carried out to selected candidate user remark information out,
And according to the feature of extraction and preset sequence rank model to the candidate real name of maximum and second largest posterior probability into
Row scoring, and the high candidate real name that will score is as the optimal real name of the user to be excavated.Wherein, the rank model can be with
For the rank model based on pairwise.The feature may include that the user before the corresponding word cutting of candidate user remark information is standby
Character length before infusing total character length of information, name, the character length after name, total character of candidate user remark information
Length, user's remarks real name habit value of user to be excavated, the corresponding user of candidate user remark information (to user to be excavated into
The user of row remarks) user's remarks real name habit value.
It before being scored using rank model, needs to establish and train rank model, establishes and train rank model
Detailed process can be with are as follows: obtain the multiple training users for being used to train rank model corresponding with the user of known users real name
Remark information, and it is candidate using training user's remark information identical in the multiple training user's remark information as training
Real name;It will be each training user's remark information corresponding to the candidate real name of the training of user's real name as the first supported collection
It closes;Corresponding first scoring values are gathered in first support;It will be complete for non-user's real name and with user's real name
Each training user's remark information corresponding to the candidate real name of the training of phonetic is as the second support set;The second support set
Corresponding second scoring values, first scoring values are greater than second scoring values;Extract the first support set
Feature and it is described second support set feature, and according to it is described first support set feature and first scoring values,
The feature of the second support set and second scoring values are established and train rank model.Therefore, it is based on rank model
The process to score to the candidate real name of maximum and second largest posterior probability can be with are as follows: candidate real according to two inputted
Support set belonging to the corresponding multiple candidate user remark informations of name (for the first support set or the second support set)
Scoring values, calculate separately out the corresponding final scoring of two candidate real names.
Optionally, if S205 calculated maximum revised posterior probability of institute in above-mentioned Fig. 2 corresponding embodiment according to
So be less than the predetermined probabilities threshold value, then it can be based on rank model maximum corresponding with second largest revised posterior probability
Candidate real name in select optimal real name.
Optionally, if the calculated maximum weight order value of S305 institute in above-mentioned Fig. 3 corresponding embodiment is still less than
The predetermined probabilities threshold value, then can be based on rank model in the maximum and second largest corresponding candidate real name of weight order value
Select optimal real name.
The embodiment of the present invention by multiple user's remark informations mining analysis go out at least one candidate user remarks letter
Breath, and will at least one described candidate user remark information identical candidate user remark information respectively as candidate real name,
And according to the corresponding phonetic of each candidate user remark information, the corresponding frequency of occurrence of each identical phonetic, and root are counted
According to the corresponding frequency of occurrence of each identical phonetic and the corresponding frequency of occurrence of each candidate real name, described in calculating
The corresponding posterior probability of each candidate's real name, and when maximum posterior probability is greater than predetermined probabilities threshold value, can will be maximum
Posterior probability optimal real name of the candidate real name as the user to be excavated, do not provide reality in user so as to realize
The real name of user is accurately analyzed in the case where name based on user's remark information, and then can be abundant based on the real name analyzed
The various functions of social networks;It, can also be further and when maximum posterior probability is less than or equal to predetermined probabilities threshold value
Optimal real name is selected in the candidate real name of maximum and second largest posterior probability based on rank model, so as to further
Improve the identification accuracy to real name.
Fig. 5 is referred to, is a kind of structural schematic diagram of data mining processing unit provided in an embodiment of the present invention, the number
It can be applied in the server based on social networks according to processing unit 1 is excavated, the data mining processing unit 1 can wrap
It includes: obtaining and excavate module 10, computing module 20, determining module 30;
Module 10 is excavated in the acquisition, for obtaining multiple user's remark informations corresponding with user to be excavated, and in institute
It states mining analysis in multiple user's remark informations and goes out at least one candidate user remark information, and at least one described candidate is used
Identical candidate user remark information is respectively as candidate real name in the remark information of family;
Specifically, the available multiple user's remark informations corresponding with user to be excavated of module 10 are excavated in the acquisition,
Wherein, the user to be excavated refers to that server need to analyze the user for identifying its true real name, the multiple user's remarks
Information refers to that other good friend users carry out the information of remarks to the user to be excavated.For example, the user to be excavated has 100
Good friend user, 100 good friend users have 75 good friend users to carry out remarks to the user to be excavated, then can be good by this 75
The information of friendly institute's remarks is as the multiple user's remark information.The acquisition excavates module 10 further in the multiple user
Mining analysis goes out at least one candidate user remark information in remark information, and will at least one described candidate user remark information
In identical candidate user remark information respectively as candidate real name.For example, at least one described candidate user remark information
It is that " yellow AC ", 15 candidate users are standby that have 20 candidate user remark informations, which be " king AB ", 3 candidate user remark informations,
It is " king AC " that note information, which is " yellow AB ", 30 candidate user remark informations, then can by " king AB ", " yellow AC ", " yellow AB ",
" king AC " is as the candidate real name.
It further, is that a kind of structure for obtaining excavation module 10 provided in an embodiment of the present invention is shown please also refer to Fig. 6
It is intended to, it may include: to obtain screening unit 101, delete determination unit 102 that module 10 is excavated in the acquisition;
The acquisition screening unit 101, for acquisition multiple user's remark informations corresponding with user to be excavated, and according to
Name tactical rule and preset surname matching list filter out in the multiple user's remark information meets the of surname condition
A kind of user's remark information;
Specifically, the name tactical rule can criticize the number of words of normal name, such as normal name is generally 2 to 4 Chinese
Word (name of monosyllabic name is 2 to 3 Chinese characters, and the name of two-character surname is 3 to 4 Chinese characters).Therefore, the acquisition screening unit 101 can
First to carry out word cutting (if user's remark information is that " he is for information to the multiple users got based on effective word cutting algorithm
King AB ", then user's remark information after word cutting becomes " king AB "), then by user's remarks after the word cutting comprising 2 to 4 Chinese characters
Information sifting comes out, and obtains preliminary screening user's remark information, later further according to the monosyllabic name set in preset surname matching list
Preliminary screening user's remark information comprising 2 words is matched, to detect preliminary screening user's remarks letter comprising 2 words
First Chinese character of breath whether there is in the monosyllabic name set, and if it exists, then determine that the preliminary screening user comprising 2 words is standby
Note information meets surname condition and as first kind user's remark information, is otherwise rejected;The acquisition screening unit
101 also simultaneously according to the two-character surname set in preset surname matching list to the preliminary screening user remark information comprising 4 words into
Row matching whether there is with detecting the first two Chinese character of preliminary screening user's remark information comprising 4 words in the two-character surname collection
In conjunction, and if it exists, then determine that preliminary screening user's remark information comprising 4 words meets surname condition and as the first kind
User's remark information, is otherwise rejected;The acquisition screening unit 101 is also simultaneously according to the monosyllabic name set and the two-character surname
Set matches preliminary screening user's remark information comprising 3 words, standby to detect the preliminary screening user comprising 3 words
First Chinese character for infusing information whether there is whether there is in the two-character surname set, only in the monosyllabic name set or the first two Chinese character
It detects to meet one of condition, it can determine that preliminary screening user's remark information comprising 3 words meets surname item
Part and as first kind user's remark information, is rejected if being all unsatisfactory for.
The deletion determination unit 102, for will include proper noun and/or height in the first kind user remark information
User's remark information of frequency word is deleted, and first kind user's remark information remaining after deletion is determined as at least one candidate
User's remark information, and will at least one described candidate user remark information identical candidate user remark information respectively as
Candidate real name;
Wherein, the proper noun may include such as teacher, master worker, sir, the proprietary role's word of Miss, the high frequency words
May include such as tomorrow, the day after tomorrow, have a meal, drink water the contour existing word that occurs frequently.For example, if some first kind user's remark information is
" teacher Wang ", then the deletion determination unit 102 can determine that first kind user's remark information includes proper noun, therefore,
The deletion determination unit 102 can delete first kind user's remark information.
The computing module 20, for counting each identical spelling according to the corresponding phonetic of each candidate user remark information
The corresponding frequency of occurrence of sound, and according to each identical corresponding frequency of occurrence of phonetic and each candidate real name point
Not corresponding frequency of occurrence calculates the corresponding posterior probability of each candidate's real name;
Specifically, be a kind of structural schematic diagram of computing module 20 provided in an embodiment of the present invention please also refer to Fig. 7,
The computing module 20 may include: phonetic acquiring unit 201, frequency statistics unit 202, the first probability calculation unit 203,
Two probability calculation units 204;
The phonetic acquiring unit 201 is described complete for obtaining the corresponding full pinyin of each candidate user remark information
Phonetic includes surname phonetic and name phonetic;
Specifically, respectively being waited in described at least one available described candidate user remark information of phonetic acquiring unit 201
The corresponding full pinyin of family remark information is selected, the full pinyin includes surname phonetic and name phonetic.For example, some is candidate
User's remark information is " Zhang Xiaobo ", then corresponding full pinyin is " zhang xiaobo ", wherein surname phonetic is " zhang ",
Name phonetic is " xiaobo ".
The frequency statistics unit 202, for counting each identical surname phonetic according to each candidate user remark information
Corresponding frequency of occurrence and the corresponding frequency of occurrence of each same name phonetic;
First probability calculation unit 203, for according to the corresponding frequency of occurrence of each identical surname phonetic, each phase
Corresponding frequency of occurrence and candidate user remark information total amount with name phonetic calculate each identical full pinyin and respectively correspond
Joint probability;
Second probability calculation unit 204, for the corresponding appearance of identical full pinyin according to maximum joint probability
The frequency and the corresponding frequency of occurrence of each candidate's real name, it is general to calculate the corresponding posteriority of each candidate's real name
Rate;
Wherein, the calculation formula of the posterior probability are as follows: posterior probability P (candidate real name | best full pinyin)=best
The frequency of occurrence of candidate real name in full pinyin/best full pinyin frequency of occurrence, the best full pinyin refers to maximum
Close the identical full pinyin of probability, wherein if the full pinyin of candidate real name is not the best full pinyin, in best full pinyin
Candidate's real name frequency of occurrence be 0.
It further, is a kind of first probability calculation unit 203 provided in an embodiment of the present invention then please also refer to Fig. 8
Structural schematic diagram, first probability calculation unit 203 may include: the first probability calculation subelement 2031, the second probability
Computation subunit 2032, joint probability calculation subelement 2033;
The first probability calculation subelement 2031, for according to the corresponding frequency of occurrence of each identical surname phonetic with
And candidate user remark information total amount, calculate corresponding first probability of each identical surname phonetic;
The second probability calculation subelement 2032, for according to the corresponding frequency of occurrence of each same name phonetic with
And candidate user remark information total amount, calculate corresponding second probability of each same name phonetic;
The joint probability calculation subelement 2033, based on being carried out to each first probability and each second probability
It calculates, to obtain the corresponding joint probability of each identical full pinyin;
Wherein, the calculation formula of the joint probability are as follows: joint probability PFull pinyin=PSurname phonetic*PName phonetic, PSurname phoneticIt is as described
First probability, PName phoneticAs described second probability.
For example, at least one described candidate user remark information includes 30 " Wu Xiaobo ", 20 " Wu little Bo ", 10
" Wu Xiaobo ", 10 " Zhang Xiaobo " and 30 " Zhang Haibo ", wherein identical full pinyin includes " wu xiaobo ", " zhang
Xiaobo ", " zhang haibo ", then the first probability calculation subelement 2031 can calculate identical surname phonetic " wu "
PSurname phoneticThe frequency of occurrence of=" wu "/candidate user remark information total amount=60/100, the first probability calculation subelement
2031 calculate the P of identical surname phonetic " zhang "Surname phoneticThe frequency of occurrence of=" zhang "/candidate user remark information total amount
=40/100, the second probability calculation subelement 2032 can calculate the P of same name phonetic " xiaobo "Name phonetic=
The frequency of occurrence of " xiaobo "/candidate user remark information total amount=70/100, the second probability calculation subelement 2032 can
To calculate the P of same name phonetic " haibo "Name phoneticThe frequency of occurrence of=" haibo "/candidate user remark information total amount=
30/100;So that the joint that the joint probability calculation subelement 2033 can calculate identical full pinyin " wu xiaobo " is general
Rate PFull pinyinThe P of=identical surname phonetic " wu "Surname phonetic* the P of same name phonetic " xiaobo "Name phonetic=42/100, the joint
Probability calculation subelement 2033 calculates the joint probability P of identical full pinyin " zhang xiaobo "Full pinyin=identical surname phonetic
The P of " zhang "Surname phonetic* the P of same name phonetic " xiaobo "Name phonetic=28/100, the joint probability calculation subelement 2033
Calculate the joint probability P of identical full pinyin " zhang haibo "Full pinyinThe P of=identical surname phonetic " zhang "Surname phonetic* identical
The P of name phonetic " haibo "Name phonetic=12/100;It can be seen that the joint probability of identical full pinyin " wu xiaobo " is maximum,
Therefore, identical full pinyin " wu xiaobo " is used as best full pinyin;Second probability calculation unit 204 may further
Calculate the posterior probability P (Wu Xiaobo | best full pinyin " wu xiaobo ")=30/60 of " Wu Xiaobo ", the second probability meter
The posterior probability P (Wu little Bo | best full pinyin " wu xiaobo ")=20/60 that unit 204 calculates " Wu little Bo " is calculated, it is described
Second probability calculation unit 204 calculate the posterior probability P (Wu Xiaobo | best full pinyin " wu xiaobo ") of " Wu Xiaobo "=
10/60, second probability calculation unit 204 calculate " Zhang Xiaobo " posterior probability P (Zhang Xiaobo | best full pinyin " wu
Xiaobo ")=0, second probability calculation unit 204 calculate " Zhang Haibo " posterior probability P (Zhang Haibo | best spelling
Sound " wu xiaobo ")=0.
The determining module 30, for using the candidate real name of maximum posterior probability as the optimal of the user to be excavated
Real name;
Specifically, the determining module 30 can incite somebody to action after calculating the corresponding posterior probability of each candidate's real name
Optimal real name of the candidate real name of maximum posterior probability as the user to be excavated, it can determine the optimal real name
What it is for the user to be excavated is really real name, so as to realize that the real name to user accurately identifies.For example, candidate real name
Including " Wu Xiaobo ", " Wu little Bo ", " Wu Xiaobo ", " Zhang Xiaobo ", " Zhang Haibo ", wherein the posterior probability of " Wu Xiaobo " is 30/
60, the posterior probability of " Wu little Bo " be the posterior probability that the posterior probability of 20/60, " Wu Xiaobo " is 10/60, " Zhang Xiaobo " be 0,
The posterior probability of " Zhang Haibo " is 0, then " Wu Xiaobo " of maximum posterior probability can be determined as institute by the determining module 30
State the optimal real name of user to be excavated.
It further, is a kind of structural representation of determining module 30 provided in an embodiment of the present invention please also refer to Fig. 9
Figure, the determining module 30 may include: the first judging unit 301, the first determination unit 302, amendment determination unit 303, the
Two judging units 304, the second determination unit 305, weight calculation determination unit 306, third judging unit 307, third determine single
First 308, model score determination unit 309;
First judging unit 301, for judging whether maximum posterior probability is greater than predetermined probabilities threshold value;
First determination unit 302 will be described maximum if being judged as YES for first judging unit 301
Optimal real name of the candidate real name of posterior probability as the user to be excavated;
The amendment determination unit 303, if being judged as NO for first judging unit 301, according to preset tune
Power rule is modified the corresponding posterior probability of each candidate real name, and by maximum revised posterior probability
Optimal real name of the candidate real name as the user to be excavated;
Wherein, the tune power rule includes: mapping relations of the frequency of occurrence with corrected parameter of candidate real name, identical spelling
The weight and mapping relations, the character complexity of candidate real name and the mapping relations of corrected parameter of corrected parameter of sound, candidate are real
At least one in the character length and the mapping relations of corrected parameter of name, the mapping relations of the popularity of surname and corrected parameter
Kind mapping relations.The tool of first judging unit 301, first determination unit 302 and the amendment determination unit 303
Body implementation may refer to the S201-S205 in above-mentioned Fig. 2 corresponding embodiment, be not discussed here.
The second judgment unit 304, for judging whether maximum posterior probability is greater than predetermined probabilities threshold value;
Second determination unit 305 will be described maximum if being judged as YES for the second judgment unit 304
Optimal real name of the candidate real name of posterior probability as the user to be excavated;
The weight calculation determination unit 306, if being judged as NO for the second judgment unit 304, according to
After each corresponding user's remarks real name habit value of candidate user remark information and each candidate real name are corresponding
It tests probability, calculates the corresponding weight order value of each candidate real name, and by the candidate real name of maximum weight order value
Optimal real name as the user to be excavated;
Wherein, user's remarks real name habit value refers to that user believe in remarks for user's remarks of real name to good friend
Quantity and the user are ceased to the ratio of the quantity of all user's remark informations of good friend's progress remarks.The second judgment unit
304, the specific implementation of second determination unit 305 and the weight calculation determination unit 306 may refer to above-mentioned
S301-S305 in Fig. 3 corresponding embodiment, is not discussed here.
The third judging unit 307, for judging whether maximum posterior probability is greater than predetermined probabilities threshold value;
The third determination unit 308 will be described maximum if being judged as YES for the third judging unit 307
Optimal real name of the candidate real name of posterior probability as the user to be excavated;
The model score determination unit 309 selects maximum if being judged as NO for the third judging unit 307
Candidate user remark information corresponding with the candidate real name of second largest posterior probability, and to selected candidate user out
Remark information carries out feature extraction, and according to the feature of extraction and preset sequence rank model to maximum and second largest
The candidate real name of posterior probability scores, and the high candidate real name that will score is as the optimal real name of the user to be excavated;
Wherein, the third judging unit 307, the third determination unit 308 and the model score determination unit
309 specific implementation may refer to the S401-S405 in above-mentioned Fig. 4 corresponding embodiment, be not discussed here.
Optionally, when first judging unit 301, first determination unit 302 and the amendment determination unit
303 when executing corresponding operating, and the second judgment unit 304, second determination unit 305, the weight calculation determine
Unit 306, the third judging unit 307, the third determination unit 308 and the model score determination unit 309 are equal
It stops working.When the second judgment unit 304, second determination unit 305 and the weight calculation determination unit 306
When executing corresponding operating, first judging unit 301, first determination unit 302, the amendment determination unit 303,
The third judging unit 307, the third determination unit 308 and the model score determination unit 309 stop working.
When the third judging unit 307, the third determination unit 308 and the model score determination unit 309 are executing phase
When should operate, first judging unit 301, first determination unit 302, the amendment determination unit 303, described second
Judging unit 304, second determination unit 305 and the weight calculation determination unit 306 stop working.Wherein, institute
Stating the first judging unit 301, the second judgment unit 304 and the third judging unit 307 can be the same judgement
Unit;First determination unit 302, second determination unit 305 and the third determination unit 308 can be same
A determination unit.
The embodiment of the present invention by multiple user's remark informations mining analysis go out at least one candidate user remarks letter
Breath, and will at least one described candidate user remark information identical candidate user remark information respectively as candidate real name,
And according to the corresponding phonetic of each candidate user remark information, the corresponding frequency of occurrence of each identical phonetic, and root are counted
According to the corresponding frequency of occurrence of each identical phonetic and the corresponding frequency of occurrence of each candidate real name, described in calculating
The corresponding posterior probability of each candidate's real name, finally using the candidate real name of maximum posterior probability as the user to be excavated
Optimal real name, use is accurately analyzed based on user's remark information in the case where user does not provide real name so as to realize
The real name at family, and then the various functions of social networks can be enriched based on the real name analyzed.
Again referring to Figure 10, it is the structural schematic diagram of another data mining processing unit 1 provided in an embodiment of the present invention,
The data mining processing unit 1 can be applied in the server based on social networks, and the data mining processing unit 1 can
It is further, described to include that module 10, computing module 20, determining module 30 are excavated in the acquisition in above-mentioned Fig. 5 corresponding embodiment
Data mining processing unit 1 can also include: to obtain determining module 40, set determining module 50, model training module 60;
The acquisition determining module 40, it is corresponding with the user of known users real name for training rank model for obtaining
Multiple training user's remark informations, and by training user's remark information identical in the multiple training user's remark information point
Candidate real name Zuo Wei not trained;
The set determining module 50, for that will be that each training corresponding to the candidate real name of training of user's real name is used
Family remark information is as the first support set;Corresponding first scoring values are gathered in first support;
The set determining module 50, being also used to will be for non-user's real name and with the full pinyin of user's real name
The candidate real name of training corresponding to each training user's remark information as the second support set;The second support set corresponds to
Second scoring values, first scoring values are greater than second scoring values;
The model training module 60, what feature and second support for extracting the first support set were gathered
Feature, and according to it is described first support set feature and first scoring values, it is described second support set feature and
Second scoring values are established and train rank model;
Wherein, pass through the acquisition determining module 40, the set determining module 50 and the model training module 60
After establishing and training rank model, the model score determination unit 309 in above-mentioned Fig. 9 corresponding embodiment can be made according to input
Support set belonging to the corresponding multiple candidate user remark informations of two candidate's real names in rank model (is first
Support set or the second support set) scoring values, calculate separately out the corresponding final scoring of two candidate real names.
Again referring to Figure 11, it is a kind of structural schematic diagram of server provided in an embodiment of the present invention, as shown in figure 11, institute
Stating server 1000 may include: at least one processor 1001, such as CPU, at least one network interface 1004, user interface
1003, memory 1005, at least one communication bus 1002.Wherein, communication bus 1002 is for realizing between these components
Connection communication.Wherein, user interface 1003 may include display screen (Display), keyboard (Keyboard), optional user interface
1003 can also include standard wireline interface and wireless interface.Network interface 1004 optionally may include that the wired of standard connects
Mouth, wireless interface (such as WI-FI interface).Memory 1005 can be high speed RAM memory, be also possible to non-labile storage
Device (non-volatile memory), for example, at least a magnetic disk storage.Memory 1005 optionally can also be at least one
A storage device for being located remotely from aforementioned processor 1001.As shown in figure 11, the memory as a kind of computer storage medium
It may include operating system, network communication module, Subscriber Interface Module SIM and equipment control application program in 1005.
In the server 1000 shown in Figure 11, network interface 1004 is mainly used for connecting client, to receive client
User's remark information of transmission;And user interface 1003 is mainly used for providing the interface of input for user, obtains user's output
Data;And processor 1001 can be used for that the equipment stored in memory 1005 is called to control application program, to realize
Multiple user's remark informations corresponding with user to be excavated are obtained, and are excavated in the multiple user's remark information
At least one candidate user remark information is analyzed, and candidate identical at least one described candidate user remark information is used
Family remark information is respectively as candidate real name;
According to the corresponding phonetic of each candidate user remark information, the corresponding appearance frequency of each identical phonetic is counted
It is secondary, and according to the corresponding frequency of occurrence of each identical phonetic and the corresponding frequency of occurrence of each candidate real name,
Calculate the corresponding posterior probability of each candidate's real name;
Using the candidate real name of maximum posterior probability as the optimal real name of the user to be excavated.
In one embodiment, the processor 1001 is executing acquisition multiple user's remarks corresponding with user to be excavated
Information, and mining analysis goes out at least one candidate user remark information in the multiple user's remark information, and by described in extremely
It is specific to execute when identical candidate user remark information is respectively as candidate real name in a few candidate user remark information:
Multiple user's remark informations corresponding with user to be excavated are obtained, and according to name tactical rule and preset surname
Matching list filters out the first kind user's remark information for meeting surname condition in the multiple user's remark information;
It will include user's remark information deletion of proper noun and/or high frequency words in the first kind user remark information,
And first kind user's remark information remaining after deletion is determined as at least one candidate user remark information, and by described in extremely
Identical candidate user remark information is respectively as candidate real name in a few candidate user remark information.
In one embodiment, the processor 1001 is being executed according to the corresponding spelling of each candidate user remark information
Sound, counts the corresponding frequency of occurrence of each identical phonetic, and according to the corresponding frequency of occurrence of each identical phonetic and
The corresponding frequency of occurrence of each candidate real name, when calculating the corresponding posterior probability of each candidate real name, specifically
It executes:
The corresponding full pinyin of each candidate user remark information is obtained, the full pinyin includes that surname phonetic and name are spelled
Sound;
Each identical corresponding frequency of occurrence of surname phonetic and each phase are counted according to each candidate user remark information
The corresponding frequency of occurrence with name phonetic;
According to the corresponding frequency of occurrence of each identical surname phonetic, the corresponding frequency of occurrence of each same name phonetic
And candidate user remark information total amount, calculate the corresponding joint probability of each identical full pinyin;
It is right respectively according to the corresponding frequency of occurrence of identical full pinyin of maximum joint probability and each candidate real name
The frequency of occurrence answered calculates the corresponding posterior probability of each candidate's real name.
In one embodiment, the processor 1001 is being executed according to corresponding appearances of each identical surname phonetic frequently
The corresponding frequency of occurrence of secondary, each same name phonetic and candidate user remark information total amount, calculate each identical full pinyin
It is specific to execute when corresponding joint probability:
According to the corresponding frequency of occurrence of each identical surname phonetic and candidate user remark information total amount, described in calculating
Corresponding first probability of each identical surname phonetic;
According to the corresponding frequency of occurrence of each same name phonetic and candidate user remark information total amount, described in calculating
Corresponding second probability of each same name phonetic;
Each first probability and each second probability are calculated, it is corresponding to obtain each identical full pinyin
Joint probability.
In one embodiment, the processor 1001 is being executed the candidate real name of maximum posterior probability as described in
It is specific to execute when the optimal real name of user to be excavated:
Judge whether maximum posterior probability is greater than predetermined probabilities threshold value;
If being judged as YES, using the candidate real name of the maximum posterior probability as the optimal reality of the user to be excavated
Name;
If being judged as NO, the corresponding posterior probability of each candidate real name is carried out according to preset tune power rule
Amendment, and using the candidate real name of maximum revised posterior probability as the optimal real name of the user to be excavated;
Wherein, the tune power rule includes: mapping relations of the frequency of occurrence with corrected parameter of candidate real name, identical spelling
The weight and mapping relations, the character complexity of candidate real name and the mapping relations of corrected parameter of corrected parameter of sound, candidate are real
At least one in the character length and the mapping relations of corrected parameter of name, the mapping relations of the popularity of surname and corrected parameter
Kind mapping relations.
In one embodiment, the processor 1001 is being executed the candidate real name of maximum posterior probability as described in
It is specific to execute when the optimal real name of user to be excavated:
Judge whether maximum posterior probability is greater than predetermined probabilities threshold value;
If being judged as YES, using the candidate real name of the maximum posterior probability as the optimal reality of the user to be excavated
Name;
If being judged as NO, according to the corresponding user's remarks real name habit value of each candidate user remark information with
And the corresponding posterior probability of each candidate real name, the corresponding weight order value of each candidate real name is calculated, and
Using the candidate real name of maximum weight order value as the optimal real name of the user to be excavated;
Wherein, user's remarks real name habit value refers to that user believe in remarks for user's remarks of real name to good friend
Quantity and the user are ceased to the ratio of the quantity of all user's remark informations of good friend's progress remarks.
In one embodiment, the processor 1001 is being executed the candidate real name of maximum posterior probability as described in
It is specific to execute when the optimal real name of user to be excavated:
Judge whether maximum posterior probability is greater than predetermined probabilities threshold value;
If being judged as YES, using the candidate real name of the maximum posterior probability as the optimal reality of the user to be excavated
Name;
If being judged as NO, select maximum candidate user corresponding with the candidate real name of second largest posterior probability standby
Information is infused, and feature extraction is carried out to selected candidate user remark information out, and according to the feature of extraction and preset
Sequence rank model score the candidate real name of maximum and second largest posterior probability, and by the high candidate real name that scores
Optimal real name as the user to be excavated.
In one embodiment, the processor 1001 also executes:
The multiple training user's remark informations for being used to train rank model corresponding with the user of known users real name are obtained,
And using training user's remark information identical in the multiple training user's remark information as the candidate real name of training;
It will be each training user's remark information corresponding to the candidate real name of the training of user's real name as the first support
Set;Corresponding first scoring values are gathered in first support;
It will be each for non-user's real name and corresponding to the candidate real name of training of the full pinyin with user's real name
Training user's remark information is as the second support set;Corresponding second scoring values are gathered in second support, and described first obtains
Fractional value is greater than second scoring values;
The feature of the first support set and the feature of the second support set are extracted, and according to first support
The feature of set and first scoring values, the feature of the second support set and second scoring values are established and are instructed
Practice rank model;
Wherein, the first support set in the rank model after training and the second support set are for institute
The candidate real name of input scores.
The embodiment of the present invention by multiple user's remark informations mining analysis go out at least one candidate user remarks letter
Breath, and will at least one described candidate user remark information identical candidate user remark information respectively as candidate real name,
And according to the corresponding phonetic of each candidate user remark information, the corresponding frequency of occurrence of each identical phonetic, and root are counted
According to the corresponding frequency of occurrence of each identical phonetic and the corresponding frequency of occurrence of each candidate real name, described in calculating
The corresponding posterior probability of each candidate's real name, finally using the candidate real name of maximum posterior probability as the user to be excavated
Optimal real name, use is accurately analyzed based on user's remark information in the case where user does not provide real name so as to realize
The real name at family, and then the various functions of social networks can be enriched based on the real name analyzed.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, the program can be stored in a computer-readable storage medium
In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, the storage medium can be magnetic
Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access
Memory, RAM) etc..
The above disclosure is only the preferred embodiments of the present invention, cannot limit the right model of the present invention with this certainly
It encloses, therefore equivalent changes made in accordance with the claims of the present invention, is still within the scope of the present invention.
Claims (16)
1. a kind of data mining processing method characterized by comprising
Obtain multiple user's remark informations corresponding with user to be excavated, and the mining analysis in the multiple user's remark information
At least one candidate user remark information out, and identical candidate user at least one described candidate user remark information is standby
Information is infused respectively as candidate real name;
According to the corresponding phonetic of each candidate user remark information, the corresponding frequency of occurrence of each identical phonetic is counted, and
According to each identical corresponding frequency of occurrence of phonetic and the corresponding frequency of occurrence of each candidate real name, calculate described each
The corresponding posterior probability of candidate real name;The posterior probability is by the identical full pinyin pair with maximum joint probability
The frequency of occurrence and the corresponding frequency of occurrence institute of each candidate's real name answered are calculated;Each identical full pinyin is right respectively
The joint probability answered be by the corresponding frequency of occurrence of each identical surname phonetic, each same name phonetic it is corresponding go out
The existing frequency and candidate user remark information total amount institute are calculated;
Using the candidate real name of maximum posterior probability as the optimal real name of the user to be excavated.
2. the method as described in claim 1, which is characterized in that described to obtain multiple user's remarks corresponding with user to be excavated
Information, and mining analysis goes out at least one candidate user remark information in the multiple user's remark information, and by described in extremely
Identical candidate user remark information is respectively as candidate real name in a few candidate user remark information, comprising:
Multiple user's remark informations corresponding with user to be excavated are obtained, and are matched according to name tactical rule and preset surname
Table filters out the first kind user's remark information for meeting surname condition in the multiple user's remark information;
It will include user's remark information deletion of proper noun and/or high frequency words in the first kind user remark information, and will
Remaining first kind user's remark information is determined as at least one candidate user remark information after deletion, and at least one by described in
Identical candidate user remark information is respectively as candidate real name in a candidate user remark information.
3. the method as described in claim 1, which is characterized in that described according to the corresponding spelling of each candidate user remark information
Sound, counts the corresponding frequency of occurrence of each identical phonetic, and according to the corresponding frequency of occurrence of each identical phonetic and
The corresponding frequency of occurrence of each candidate real name calculates the corresponding posterior probability of each candidate real name, comprising:
The corresponding full pinyin of each candidate user remark information is obtained, the full pinyin includes surname phonetic and name phonetic;
According to the corresponding frequency of occurrence of each identical surname phonetic of each candidate user remark information statistics with it is each mutually of the same name
The corresponding frequency of occurrence of word phonetic;
According to the corresponding frequency of occurrence of each identical surname phonetic, the corresponding frequency of occurrence of each same name phonetic and
Candidate user remark information total amount calculates the corresponding joint probability of each identical full pinyin;
It is corresponding according to the corresponding frequency of occurrence of identical full pinyin of maximum joint probability and each candidate real name
Frequency of occurrence calculates the corresponding posterior probability of each candidate's real name.
4. method as claimed in claim 3, which is characterized in that described according to the corresponding appearance frequency of each identical surname phonetic
The corresponding frequency of occurrence of secondary, each same name phonetic and candidate user remark information total amount, calculate each identical full pinyin
Corresponding joint probability, comprising:
According to the corresponding frequency of occurrence of each identical surname phonetic and candidate user remark information total amount, each phase is calculated
Corresponding first probability with surname phonetic;
According to the corresponding frequency of occurrence of each same name phonetic and candidate user remark information total amount, each phase is calculated
Corresponding second probability with name phonetic;
Each first probability and each second probability are calculated, to obtain the corresponding joint of each identical full pinyin
Probability.
5. the method as described in claim 1, which is characterized in that it is described using the candidate real name of maximum posterior probability as described in
The optimal real name of user to be excavated, comprising:
Judge whether maximum posterior probability is greater than predetermined probabilities threshold value;
If being judged as YES, using the candidate real name of the maximum posterior probability as the optimal real name of the user to be excavated;
If being judged as NO, the corresponding posterior probability of each candidate real name is repaired according to preset tune power rule
Just, and using the candidate real name of maximum revised posterior probability as the optimal real name of the user to be excavated;
Wherein, the tune power rule includes: mapping relations of the frequency of occurrence with corrected parameter of candidate real name, identical full pinyin
Weight and the mapping relations of corrected parameter, the mapping relations of the character complexity of candidate real name and corrected parameter, candidate real name
At least one of the mapping relations of character length and corrected parameter, the popularity of surname and mapping relations of corrected parameter are reflected
Penetrate relationship.
6. the method as described in claim 1, which is characterized in that it is described using the candidate real name of maximum posterior probability as described in
The optimal real name of user to be excavated, comprising:
Judge whether maximum posterior probability is greater than predetermined probabilities threshold value;
If being judged as YES, using the candidate real name of the maximum posterior probability as the optimal real name of the user to be excavated;
If being judged as NO, according to the corresponding user's remarks real name habit value of each candidate user remark information and institute
The corresponding posterior probability of each candidate real name is stated, calculates the corresponding weight order value of each candidate's real name, and will most
Optimal real name of the candidate real name of big weight order value as the user to be excavated;
Wherein, it is user's remark information number of real name in remarks that user's remarks real name habit value, which refers to that user carries out good friend,
Amount carries out the ratio of the quantity of all user's remark informations of remarks with the user to good friend.
7. the method as described in claim 1, which is characterized in that it is described using the candidate real name of maximum posterior probability as described in
The optimal real name of user to be excavated, comprising:
Judge whether maximum posterior probability is greater than predetermined probabilities threshold value;
If being judged as YES, using the candidate real name of the maximum posterior probability as the optimal real name of the user to be excavated;
If being judged as NO, maximum candidate user remarks letter corresponding with the candidate real name of second largest posterior probability is selected
Breath, and feature extraction is carried out to selected candidate user remark information out, and according to the feature of extraction and preset row
Sequence rank model scores to the candidate real name of maximum and second largest posterior probability, and will score high candidate real name as
The optimal real name of the user to be excavated.
8. the method for claim 7, which is characterized in that further include:
The multiple training user's remark informations for being used to train rank model corresponding with the user of known users real name are obtained, and will
Identical training user's remark information is respectively as the candidate real name of training in the multiple training user's remark information;
It will be each training user's remark information corresponding to the candidate real name of the training of user's real name as the first support set;
Corresponding first scoring values are gathered in first support;
It will be for non-user's real name and each training corresponding to the candidate real name of training of the full pinyin with user's real name
User's remark information is as the second support set;Corresponding second scoring values, first goals for are gathered in second support
Value is greater than second scoring values;
The feature of the first support set and the feature of the second support set are extracted, and is gathered according to first support
Feature and first scoring values, it is described second support set feature and second scoring values establish and train
Rank model;
Wherein, the first support set in the rank model after training and the second support set are for being inputted
Candidate real name score.
9. a kind of data mining processing unit characterized by comprising
It obtains and excavates module, for obtaining multiple user's remark informations corresponding with user to be excavated, and in the multiple user
Mining analysis goes out at least one candidate user remark information in remark information, and will at least one described candidate user remark information
In identical candidate user remark information respectively as candidate real name;
Computing module, for counting each identical phonetic and respectively corresponding according to the corresponding phonetic of each candidate user remark information
Frequency of occurrence, and frequently according to each corresponding frequency of occurrence of identical phonetic and corresponding appearances of each candidate real name
It is secondary, calculate the corresponding posterior probability of each candidate's real name;The posterior probability is by with maximum joint probability
The corresponding frequency of occurrence of identical full pinyin and the corresponding frequency of occurrence institute of each candidate real name it is calculated;Each phase
It with the corresponding joint probability of full pinyin is spelled by the corresponding frequency of occurrence of each identical surname phonetic, each same name
The corresponding frequency of occurrence of sound and candidate user remark information total amount institute are calculated;
Determining module, for using the candidate real name of maximum posterior probability as the optimal real name of the user to be excavated.
10. device as claimed in claim 9, which is characterized in that the acquisition excavates module and includes:
Screening unit is obtained, is advised for obtaining multiple user's remark informations corresponding with user to be excavated, and according to name structure
Then filtered out in the multiple user's remark information with preset surname matching list meet surname condition first kind user it is standby
Infuse information;
Delete determination unit, for by include in the first kind user remark information proper noun and/or high frequency words user
Remark information is deleted, and first kind user's remark information remaining after deletion is determined as at least one candidate user remarks letter
Breath, and by identical candidate user remark information at least one described candidate user remark information respectively as candidate real name.
11. device as claimed in claim 9, which is characterized in that the computing module includes:
Phonetic acquiring unit, for obtaining the corresponding full pinyin of each candidate user remark information, the full pinyin includes surname
Family name's phonetic and name phonetic;
Frequency statistics unit, it is corresponding out for counting each identical surname phonetic according to each candidate user remark information
The existing frequency and the corresponding frequency of occurrence of each same name phonetic;
First probability calculation unit, for according to the corresponding frequency of occurrence of each identical surname phonetic, each same name phonetic
It is general to calculate the corresponding joint of each identical full pinyin for corresponding frequency of occurrence and candidate user remark information total amount
Rate;
Second probability calculation unit, for according to the corresponding frequency of occurrence of identical full pinyin of maximum joint probability and described
The corresponding frequency of occurrence of each candidate's real name calculates the corresponding posterior probability of each candidate real name.
12. device as claimed in claim 11, which is characterized in that first probability calculation unit includes:
First probability calculation subelement, for standby according to the corresponding frequency of occurrence of each identical surname phonetic and candidate user
Informational capacity is infused, corresponding first probability of each identical surname phonetic is calculated;
Second probability calculation subelement, for standby according to the corresponding frequency of occurrence of each same name phonetic and candidate user
Informational capacity is infused, corresponding second probability of each same name phonetic is calculated;
Joint probability calculation subelement, it is each to obtain for calculating each first probability and each second probability
The corresponding joint probability of identical full pinyin.
13. device as claimed in claim 9, which is characterized in that the determining module includes:
First judging unit, for judging whether maximum posterior probability is greater than predetermined probabilities threshold value;
First determination unit, if being judged as YES for first judging unit, by the candidate of the maximum posterior probability
Optimal real name of the real name as the user to be excavated;
Determination unit is corrected, if being judged as NO for first judging unit, according to preset tune power rule to described each
The corresponding posterior probability of candidate real name is modified, and using the candidate real name of maximum revised posterior probability as institute
State the optimal real name of user to be excavated;
Wherein, the tune power rule includes: mapping relations of the frequency of occurrence with corrected parameter of candidate real name, identical full pinyin
Weight and the mapping relations of corrected parameter, the mapping relations of the character complexity of candidate real name and corrected parameter, candidate real name
At least one of the mapping relations of character length and corrected parameter, the popularity of surname and mapping relations of corrected parameter are reflected
Penetrate relationship.
14. device as claimed in claim 9, which is characterized in that the determining module includes:
Second judgment unit, for judging whether maximum posterior probability is greater than predetermined probabilities threshold value;
Second determination unit, if being judged as YES for the second judgment unit, by the candidate of the maximum posterior probability
Optimal real name of the real name as the user to be excavated;
Weight calculation determination unit, if being judged as NO for the second judgment unit, according to each candidate user remarks
The corresponding user's remarks real name habit value of information and the corresponding posterior probability of each candidate real name, described in calculating
The corresponding weight order value of each candidate's real name, and using the candidate real name of maximum weight order value as the use to be excavated
The optimal real name at family;
Wherein, it is user's remark information number of real name in remarks that user's remarks real name habit value, which refers to that user carries out good friend,
Amount carries out the ratio of the quantity of all user's remark informations of remarks with the user to good friend.
15. device as claimed in claim 9, which is characterized in that the determining module includes:
Third judging unit, for judging whether maximum posterior probability is greater than predetermined probabilities threshold value;
Third determination unit, if being judged as YES for the third judging unit, by the candidate of the maximum posterior probability
Optimal real name of the real name as the user to be excavated;
Model score determination unit selects maximum and second largest posteriority if being judged as NO for the third judging unit
The corresponding candidate user remark information of the candidate real name of probability, and selected candidate user remark information out is carried out special
Sign extracts, and according to the feature of extraction and preset sequence rank model to the maximum candidate with second largest posterior probability
Real name scores, and the high candidate real name that will score is as the optimal real name of the user to be excavated.
16. device as claimed in claim 15, which is characterized in that further include:
Determining module is obtained, it is corresponding with the user of known users real name for training multiple training of rank model for obtaining
User's remark information, and using training user's remark information identical in the multiple training user's remark information as training
Candidate real name;
Gather determining module, for that will be each training user's remark information corresponding to the candidate real name of training of user's real name
As the first support set;Corresponding first scoring values are gathered in first support;
The set determining module, be also used to will for non-user's real name and with user's real name full pinyin training
Each training user's remark information corresponding to candidate real name is as the second support set;The second support set corresponding second
Fractional value, first scoring values are greater than second scoring values;
Model training module, for extracting the feature of the first support set and the feature of the second support set, and root
It is obtained according to the feature and described second of the feature of the first support set and first scoring values, the second support set
Fractional value is established and trains rank model;
Wherein, the first support set in the rank model after training and the second support set are for being inputted
Candidate real name score.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610387322.5A CN106021235B (en) | 2016-06-01 | 2016-06-01 | A kind of data mining processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610387322.5A CN106021235B (en) | 2016-06-01 | 2016-06-01 | A kind of data mining processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106021235A CN106021235A (en) | 2016-10-12 |
CN106021235B true CN106021235B (en) | 2019-01-29 |
Family
ID=57089437
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610387322.5A Active CN106021235B (en) | 2016-06-01 | 2016-06-01 | A kind of data mining processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106021235B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107329672B (en) * | 2017-07-18 | 2020-01-14 | 携程旅游网络技术(上海)有限公司 | Method, system, device and storage medium for checking hyperlink through mouse track |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102004788A (en) * | 2010-12-07 | 2011-04-06 | 北京开心人信息技术有限公司 | Method and system for intelligently positioning linkman of social networking services |
CN104573076A (en) * | 2015-01-27 | 2015-04-29 | 南京烽火星空通信发展有限公司 | Social networking site user Chinese remark name system recommendation method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6364775B2 (en) * | 2014-01-09 | 2018-08-01 | サクサ株式会社 | Electronic conference system and program thereof |
-
2016
- 2016-06-01 CN CN201610387322.5A patent/CN106021235B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102004788A (en) * | 2010-12-07 | 2011-04-06 | 北京开心人信息技术有限公司 | Method and system for intelligently positioning linkman of social networking services |
CN104573076A (en) * | 2015-01-27 | 2015-04-29 | 南京烽火星空通信发展有限公司 | Social networking site user Chinese remark name system recommendation method |
Non-Patent Citations (2)
Title |
---|
Social Network Analysis on Name Disambiguation and More;Byung-Won On;《Third 2008 International Conference on Convergence and Hybrid Information Technology》;20081111;第1081-1088页 |
实名SNS社交网络与微博的特征分析;任蔷 等;《现代情报》;20130731;第33卷(第7期);第94-98页 |
Also Published As
Publication number | Publication date |
---|---|
CN106021235A (en) | 2016-10-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103914494B (en) | Method and system for identifying identity of microblog user | |
CN103076892B (en) | A kind of method and apparatus of the input candidate item for providing corresponding to input character string | |
CN103886034B (en) | A kind of method and apparatus of inquiry input information that establishing index and matching user | |
CN103198057B (en) | One kind adds tagged method and apparatus to document automatically | |
CN103853738B (en) | A kind of recognition methods of info web correlation region | |
CN108628971A (en) | File classification method, text classifier and the storage medium of imbalanced data sets | |
CN104504264B (en) | Visual human's method for building up and device | |
CN108897732A (en) | Statement type recognition methods and device, storage medium and electronic device | |
CN103282903A (en) | Topic extraction device and program | |
US20170351739A1 (en) | Method and apparatus for identifying timeliness-oriented demands, an apparatus and non-volatile computer storage medium | |
CN110458296B (en) | Method and device for marking target event, storage medium and electronic device | |
CN108255552A (en) | PUSH message method of reseptance, device, equipment and computer readable storage medium | |
CN110362601A (en) | Mapping method, device, equipment and the storage medium of metadata standard | |
Parmar et al. | Team performance indicators that predict match outcome and points difference in professional rugby league | |
CN103646074A (en) | Method and device for determining core words of description texts in picture clusters | |
CN114969326A (en) | Classification model training and semantic classification method, device, equipment and medium | |
CN107085568A (en) | A kind of text similarity method of discrimination and device | |
CN106021235B (en) | A kind of data mining processing method and device | |
CN105323763B (en) | A kind of recognition methods of junk short message and device | |
CN111950267B (en) | Text triplet extraction method and device, electronic equipment and storage medium | |
CN110457601A (en) | The recognition methods and device of social account, storage medium and electronic device | |
US11367311B2 (en) | Face recognition method and apparatus, server, and storage medium | |
CN113705164A (en) | Text processing method and device, computer equipment and readable storage medium | |
CN103246642A (en) | Information processing device and information processing method | |
JP5512737B2 (en) | Topic extraction apparatus and topic extraction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20240104 Address after: 518057 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 floors Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd. Address before: 2, 518000, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. |