CN106021235A - Data mining processing method and device - Google Patents
Data mining processing method and device Download PDFInfo
- Publication number
- CN106021235A CN106021235A CN201610387322.5A CN201610387322A CN106021235A CN 106021235 A CN106021235 A CN 106021235A CN 201610387322 A CN201610387322 A CN 201610387322A CN 106021235 A CN106021235 A CN 106021235A
- Authority
- CN
- China
- Prior art keywords
- user
- candidate
- real name
- remark information
- phonetic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000007418 data mining Methods 0.000 title claims abstract description 27
- 238000003672 processing method Methods 0.000 title claims abstract description 16
- 238000000034 method Methods 0.000 claims abstract description 29
- 238000012549 training Methods 0.000 claims description 45
- 238000013507 mapping Methods 0.000 claims description 42
- 238000004364 calculation method Methods 0.000 claims description 39
- 238000012216 screening Methods 0.000 claims description 25
- 238000004458 analytical method Methods 0.000 claims description 22
- 238000005065 mining Methods 0.000 claims description 22
- 238000012937 correction Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 10
- 238000012217 deletion Methods 0.000 claims description 4
- 230000037430 deletion Effects 0.000 claims description 4
- 230000006870 function Effects 0.000 abstract description 10
- 238000001514 detection method Methods 0.000 description 5
- KLCDQSGLLRINHY-VHEBQXMUSA-N Yellow AB Chemical compound NC1=CC=C2C=CC=CC2=C1\N=N\C1=CC=CC=C1 KLCDQSGLLRINHY-VHEBQXMUSA-N 0.000 description 4
- 238000004891 communication Methods 0.000 description 4
- 238000009412 basement excavation Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 241000208340 Araliaceae Species 0.000 description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 235000008434 ginseng Nutrition 0.000 description 2
- 235000012054 meals Nutrition 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- 238000000429 assembly Methods 0.000 description 1
- 230000000712 assembly Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/226—Validation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Economics (AREA)
- Computing Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the invention discloses a data mining processing method and device. The method comprises the steps that multiple pieces of user note information corresponding to a use to be mined are obtained, and at least one piece of candidate user note information is mined and analyzed from the multiple pieces of user note information, and the same pieces of candidate user note information in the candidate user note information are used as candidate real names respectively; according to phonetic alphabets corresponding to the candidate user note information, occurrence frequencies corresponding to the same phonetic alphabets are calculated, and a posterior probability corresponding to each candidate real name is calculated according to the occurrence frequencies corresponding to the same phonetic alphabets and the occurrence frequency corresponding to each candidate real name; the candidate real name with the largest posterior probability is used as the optimal real name of the user to be mined. By the adoption of the data mining processing method and device, the user real time can be accurately recognized, and therefore the functions of a social network are enriched.
Description
Technical field
The present invention relates to Internet technical field, particularly relate to a kind of data mining processing method and device.
Background technology
Along with the development of Internet technology, more and more users can participate in social networks.User is adding
Before social networks, need first to carry out user's registration, and the user name registered can be that user is the most defeated
The character entered, i.e. user's registration information can not comprise the real name of user.And to enter in social networks
Row security monitoring, then need the real name of user just to can recognize that whether user is fraudulent user;And for example to
Social networks carries out accurate crowd's excavation, is then also required to use the real name of user.But for current society
Hand over network, the real name obtaining user independently can only be provided by user, and when user is reluctant to provide real name,
The server side of social networks is the real name that cannot learn this user, thus causes the partial function of social networks
Cannot be fully achieved.
Summary of the invention
The embodiment of the present invention provides a kind of data mining processing method and device, can accurately analyze and identify use
Family real name, with the function of abundant social networks.
Embodiments provide a kind of data mining processing method, including:
Obtain the multiple user remark informations corresponding with user to be excavated, and at the plurality of user's remark information
Middle mining analysis goes out at least one candidate user remark information, and is believed by least one candidate user remarks described
Candidate user remark information identical in breath is respectively as candidate's real name;
According to the phonetic that each candidate user remark information is the most corresponding, add up going out of each identical phonetic correspondence respectively
The existing frequency, and frequency of occurrence and the described each candidate's real name according to described each identical phonetic correspondence respectively is the most right
The frequency of occurrence answered, calculates the posterior probability that described each candidate's real name is the most corresponding;
Using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated.
Correspondingly, the embodiment of the present invention additionally provides a kind of data mining processing means, including:
Obtain and excavate module, for obtaining the multiple user remark informations corresponding with user to be excavated, and in institute
State mining analysis in multiple user's remark information and go out at least one candidate user remark information, and by described at least
Candidate user remark information identical in one candidate user remark information is respectively as candidate's real name;
Computing module, for the phonetic the most corresponding according to each candidate user remark information, adds up each identical spelling
The frequency of occurrence that cent is not corresponding, and according to frequency of occurrence corresponding to described each identical phonetic and described respectively
The frequency of occurrence that candidate's real name is the most corresponding, calculates the posterior probability that described each candidate's real name is the most corresponding;
Determine module, for using candidate's real name of maximum posterior probability as the optimum of described user to be excavated
Real name.
It is standby that the embodiment of the present invention goes out at least one candidate user by mining analysis in multiple user's remark informations
Note information, and by candidate user remark information identical at least one candidate user remark information described respectively
As candidate's real name, and according to the phonetic of each candidate user remark information correspondence respectively, add up each identical phonetic
Distinguish corresponding frequency of occurrence, and distinguish corresponding frequency of occurrence and described each time according to described each identical phonetic
Select the frequency of occurrence that real name is the most corresponding, calculate the posterior probability that described each candidate's real name is the most corresponding, finally
Using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated, such that it is able to realize
Accurately analyze the real name of user based on user's remark information in the case of user does not provide real name, and then
The various functions of social networks can be enriched based on the real name analyzed.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to enforcement
In example or description of the prior art, the required accompanying drawing used is briefly described, it should be apparent that, describe below
In accompanying drawing be only some embodiments of the present invention, for those of ordinary skill in the art, do not paying
On the premise of going out creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the schematic flow sheet of a kind of data mining processing method that the embodiment of the present invention provides;
Fig. 2 is the schematic flow sheet of the another kind of data mining processing method that the embodiment of the present invention provides;
Fig. 3 is the schematic flow sheet of another data mining processing method that the embodiment of the present invention provides;
Fig. 4 is the schematic flow sheet of another data mining processing method that the embodiment of the present invention provides;
Fig. 5 is the structural representation of a kind of data mining processing means that the embodiment of the present invention provides;
Fig. 6 is a kind of structural representation obtaining excavation module that the embodiment of the present invention provides;
Fig. 7 is the structural representation of a kind of computing module that the embodiment of the present invention provides;
Fig. 8 is the structural representation of a kind of first probability calculation unit that the embodiment of the present invention provides;
Fig. 9 is a kind of structural representation determining module that the embodiment of the present invention provides;
Figure 10 is the structural representation of the another kind of data mining processing means that the embodiment of the present invention provides;
Figure 11 is the structural representation of a kind of server that the embodiment of the present invention provides.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clearly
Chu, be fully described by, it is clear that described embodiment be only a part of embodiment of the present invention rather than
Whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creation
The every other embodiment obtained under property work premise, broadly falls into the scope of protection of the invention.
Refer to Fig. 1, be the schematic flow sheet of a kind of data mining processing method that the embodiment of the present invention provides,
Described method may include that
S101, obtains the multiple user remark informations corresponding with user to be excavated, and standby the plurality of user
In note information, mining analysis goes out at least one candidate user remark information, and by least one candidate user described
Candidate user remark information identical in remark information is respectively as candidate's real name;
Concrete, it is standby that server based on social networks can obtain the multiple users corresponding with user to be excavated
Note information, wherein, described user to be excavated refers to that server need to analyze the user identifying its real real name,
The plurality of user's remark information refers to that other good friend users carry out the information of remarks to described user to be excavated.
Such as, described user to be excavated has 100 good friend users, 100 good friend users to have 75 good friend users couple
Described user to be excavated carries out remarks, then can be using the information of these 75 good friend institute remarks as the plurality of use
Family remark information.Described server further in the plurality of user's remark information mining analysis go out at least one
Individual candidate user remark information, and by candidate user identical at least one candidate user remark information described
Remark information is respectively as candidate's real name.Such as, at least one candidate user remark information described there are 20
Candidate user remark information is " king AB ", 3 candidate user remark informations be " yellow AC ", 15
Candidate user remark information is " yellow AB ", 30 candidate user remark informations are " king AC ", then may be used
Using by " king AB ", " yellow AC ", " yellow AB ", " king AC " all as described candidate's real name.
Wherein, described server mining analysis in the plurality of user's remark information goes out at least one candidate use
The detailed process of family remark information can be: obtains the multiple user remark informations corresponding with user to be excavated,
And filter out full in the plurality of user's remark information according to name tactical rule and the surname matching list preset
First kind user's remark information of foot surname condition;Described first kind user's remark information will comprise proprietary name
User's remark information of word and/or high frequency words is deleted, and first kind user's remark information remaining after deleting
It is defined as at least one candidate user remark information.Wherein, described proper noun can include such as teacher, teacher
Proprietary role's words such as Fu, sir, Miss, described high frequency words can include such as tomorrow, the day after tomorrow, have a meal, drinks
The contour existing word that occurs frequently of water.Such as, if certain first kind user's remark information is " teacher Wang ", then can be true
Fixed this first kind user's remark information comprises proper noun, therefore, it can delete this first kind user's remarks letter
Breath.
Wherein, described name tactical rule can be made a comment or criticism the number of words of normal name, as normal name is generally 2 to 4
Individual Chinese character (name of monosyllabic name is 2 to 3 Chinese characters, and the name of two-character surname is 3 to 4 Chinese characters).Therefore, institute
State and filter out full in the plurality of user's remark information according to name tactical rule and the surname matching list preset
The detailed process of first kind user's remark information of foot surname condition can be: described server can first based on
Effectively cut the word algorithm multiple users to getting and cut word (if user's remark information is " he for information
It is king AB ", then cut the user's remark information after word and become " king AB "), then 2 to 4 Chinese characters will be comprised
Cut the user's remark information after word to screen, obtain Preliminary screening user's remark information, afterwards further according in advance
If surname matching list in monosyllabic name set the Preliminary screening user's remark information comprising 2 words is mated,
To detect whether first Chinese character of the Preliminary screening user's remark information comprising 2 words is present in described monosyllabic name
In set, if existing, it is determined that the Preliminary screening user's remark information comprising 2 words meets surname condition also
As first kind user's remark information, otherwise rejected;And simultaneously according to the surname matching list preset
In two-character surname set the Preliminary screening user's remark information comprising 4 words is mated, with detection comprise 4
Whether the first two Chinese character of Preliminary screening user's remark information of individual word is present in described two-character surname set, if depositing
, it is determined that the Preliminary screening user's remark information comprising 4 words meets surname condition and as first
Class user's remark information, is otherwise rejected;And simultaneously according to described monosyllabic name set and described two-character surname set pair
The Preliminary screening user's remark information comprising 3 words mates, and comprises the Preliminary screening of 3 words with detection
First Chinese character of user's remark information whether is present in described monosyllabic name set or whether the first two Chinese character is present in
Described two-character surname set, as long as detecting and meeting one of them condition, i.e. may determine that and comprises the preliminary of 3 words
Screening user's remark information meets surname condition and as first kind user's remark information, if being all unsatisfactory for
Then rejected.
S102, according to the phonetic that each candidate user remark information is the most corresponding, adds up each identical phonetic the most right
The frequency of occurrence answered, and distinguish corresponding frequency of occurrence and described each candidate's real name according to described each identical phonetic
The most corresponding frequency of occurrence, calculates the posterior probability that described each candidate's real name is the most corresponding;
Concrete, described server can obtain each candidate at least one candidate user remark information described and use
The full pinyin that family remark information is the most corresponding, described full pinyin includes surname phonetic and name phonetic.Such as,
Certain candidate user remark information is " Zhang Xiaobo ", then corresponding full pinyin is " zhang xiaobo ", wherein,
Surname phonetic is " zhang ", and name phonetic is " xiaobo ".Described server is used further according to described each candidate
Family remark information adds up frequency of occurrence corresponding to each identical surname phonetic and each same name phonetic is the most right
The frequency of occurrence answered, such as, at least one candidate user remark information described include 20 " Zhang Xiaobo ",
25 " Zhang Xiaobo ", 10 " Wang Xiafang " and 5 " Zhang Haibo ", then be appreciated that identical surname is spelled
Sound includes " zhang " and " wang ", and same name phonetic includes " xiaobo " and " haibo ", thus can
To count the frequency of occurrence of identical surname phonetic " zhang " for 50, identical surname phonetic " wang " goes out
The existing frequency is 10, and the frequency of occurrence of same name phonetic " xiaobo " is 55, same name phonetic " haibo "
Frequency of occurrence be 5.Hereafter, described server further according to the respectively corresponding frequency of occurrence of each identical surname phonetic,
The frequency of occurrence of each same name phonetic correspondence respectively and candidate user remark information total amount, calculate each identical
The joint probability that full pinyin is respectively corresponding, and according to appearance corresponding to the identical full pinyin of maximum joint probability
The frequency of occurrence of the frequency and described each candidate's real name correspondence respectively, calculates described each candidate's real name correspondence respectively
Posterior probability.
Wherein, the described frequency of occurrence corresponding respectively according to each identical surname phonetic, each same name phonetic divide
Not corresponding frequency of occurrence and candidate user remark information total amount, calculate each identical full pinyin correspondence respectively
The detailed process of joint probability is: the frequency of occurrence the most corresponding according to each identical surname phonetic and candidate use
Family remark information total amount, calculates the first probability that described each identical surname phonetic is the most corresponding;According to each identical
The frequency of occurrence of name phonetic correspondence respectively and candidate user remark information total amount, calculate described mutually the most of the same name
The second probability that word phonetic is the most corresponding;Each described first probability and each described second probability are calculated,
The joint probability the most corresponding to obtain each identical full pinyin.
Wherein, the computing formula of described joint probability is: joint probability PFull pinyin=PSurname phonetic*PName phonetic,
PSurname phoneticIt is described first probability, PName phoneticIt is described second probability.The computing formula of described posterior probability
Appearance frequency for candidate's real name of: posterior probability P (candidate's real name | optimal full pinyin)=in optimal full pinyin
The frequency of occurrence of secondary/optimal full pinyin, described optimal full pinyin refers to the identical full pinyin of the joint probability of maximum,
Wherein, if the full pinyin of candidate's real name is not described optimal full pinyin, then this candidate in optimal full pinyin
The frequency of occurrence of real name is 0.Such as, at least one candidate user remark information described includes 30 " Wu Xiao
Ripple ", 20 " Wu little Bo ", 10 " Wu Xiaobo ", 10 " Zhang Xiaobo " and 30 " Zhang Haibo ",
The most identical full pinyin includes " wu xiaobo ", " zhang xiaobo ", " zhang haibo ", mutually the most of the same surname
The P of family name's phonetic " wu "Surname phoneticThe frequency of occurrence of=" wu "/candidate user remark information total amount=60/100, phase
P with surname phonetic " zhang "Surname phoneticThe frequency of occurrence of=" zhang "/candidate user remark information total amount
=40/100, the P of same name phonetic " xiaobo "Name phoneticFrequency of occurrence/the candidate user of=" xiaobo " is standby
Note informational capacity=70/100, the P of same name phonetic " haibo "Name phoneticThe frequency of occurrence of=" haibo "/
Candidate user remark information total amount=30/100;Such that it is able to calculate the connection of identical full pinyin " wu xiaobo "
Close probability PFull pinyinThe P of=identical surname phonetic " wu "Surname phonetic* same name phonetic " xiaobo "
PName phonetic=42/100, joint probability P of identical full pinyin " zhang xiaobo "Full pinyin=identical surname phonetic
The P of " zhang "Surname phonetic* the P of same name phonetic " xiaobo "Name phonetic=28/100, identical full pinyin " zhang
Haibo " joint probability PFull pinyinThe P of=identical surname phonetic " zhang "Surname phonetic* same name phonetic " haibo "
PName phonetic=12/100;As can be seen here, the joint probability of identical full pinyin " wu xiaobo " is maximum, therefore,
Using identical full pinyin " wu xiaobo " as optimal full pinyin;Can calculate " Wu Xiaobo " further
Posterior probability P (Wu Xiaobo | optimal full pinyin " wu xiaobo ")=30/60, posterior probability P of " Wu little Bo "
(Wu little Bo | optimal full pinyin " wu xiaobo ")=20/60, posterior probability P of " Wu Xiaobo " (Wu Xiaobo |
Optimal full pinyin " wu xiaobo ")=10/60, posterior probability P of " Zhang Xiaobo " (Zhang Xiaobo | optimal spelling
Sound " wu xiaobo ")=0, posterior probability P of " Zhang Haibo " (Zhang Haibo | optimal full pinyin " wu xiaobo ")
=0.
S103, using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated;
Concrete, after described server calculates the posterior probability that described each candidate's real name is the most corresponding, permissible
Using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated, i.e. can be by described
What optimum real name was defined as described user to be excavated is really real name, such that it is able to realize carrying out the real name of user
Accurately identify.Such as, candidate's real name include " Wu Xiaobo ", " Wu little Bo ", " Wu Xiaobo ", " Zhang Xiaobo ",
" Zhang Haibo ", wherein, the posterior probability of " Wu Xiaobo " is 30/60, the posterior probability of " Wu little Bo " is 20/60,
The posterior probability of " Wu Xiaobo " is 10/60, the posterior probability of " Zhang Xiaobo " is 0, the posteriority of " Zhang Haibo "
Probability is 0, then " Wu Xiaobo " of maximum posterior probability can be defined as the optimum of described user to be excavated
Real name.
It is standby that the embodiment of the present invention goes out at least one candidate user by mining analysis in multiple user's remark informations
Note information, and by candidate user remark information identical at least one candidate user remark information described respectively
As candidate's real name, and according to the phonetic of each candidate user remark information correspondence respectively, add up each identical phonetic
Distinguish corresponding frequency of occurrence, and distinguish corresponding frequency of occurrence and described each time according to described each identical phonetic
Select the frequency of occurrence that real name is the most corresponding, calculate the posterior probability that described each candidate's real name is the most corresponding, finally
Using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated, such that it is able to realize
Accurately analyze the real name of user based on user's remark information in the case of user does not provide real name, and then
The various functions of social networks can be enriched based on the real name analyzed.
Refer to Fig. 2, be the flow process signal of the another kind of data mining processing method that the embodiment of the present invention provides
Figure, described method may include that
S201, obtains the multiple user remark informations corresponding with user to be excavated, and standby the plurality of user
In note information, mining analysis goes out at least one candidate user remark information, and by least one candidate user described
Candidate user remark information identical in remark information is respectively as candidate's real name;
S202, according to the phonetic that each candidate user remark information is the most corresponding, adds up each identical phonetic the most right
The frequency of occurrence answered, and distinguish corresponding frequency of occurrence and described each candidate's real name according to described each identical phonetic
The most corresponding frequency of occurrence, calculates the posterior probability that described each candidate's real name is the most corresponding;
Wherein, during the specific implementation of S201 to S202 step may refer to above-mentioned Fig. 1 correspondence embodiment
S101 to S102, is not discussed here.
S203, it is judged that whether maximum posterior probability is more than predetermined probabilities threshold value;
Concrete, after described server calculates the posterior probability that described each candidate's real name is the most corresponding, permissible
Determine whether that whether the posterior probability of maximum is more than predetermined probabilities threshold value.
S204, using candidate's real name of the posterior probability of described maximum as the optimum real name of described user to be excavated;
Concrete, if S203 is judged as YES, illustrate that the posterior probability of described maximum has enough credibilitys,
Therefore, it can candidate's real name optimum real name as described user to be excavated of posterior probability using described maximum,
To ensure that described optimum real name is exactly the real real name of described user to be excavated.
S205, repaiies the posterior probability that described each candidate's real name is corresponding respectively according to default power rule of adjusting
Just, and using candidate's real name of maximum revised posterior probability as the optimum real name of described user to be excavated;
Concrete, if S203 is judged as NO, the most described server can be according to default tune power rule to described
The posterior probability of each candidate's real name correspondence respectively is modified, and by the time of maximum revised posterior probability
Select real name as the optimum real name of described user to be excavated.Described tune power rule includes: the appearance of candidate's real name
The frequency and the mapping relations of corrected parameter, weight and the mapping relations of corrected parameter, the candidate of identical full pinyin
The character complexity of real name and the mapping relations of corrected parameter, character length and the corrected parameter of candidate's real name
Mapping relations, the popularity of surname and at least one mapping relations in the mapping relations of corrected parameter.Institute
The mapping relations of the frequency of occurrence and corrected parameter of stating candidate's real name refer to multiple different frequency of occurrence scope with
Mapping relations between multiple different corrected parameters, the biggest frequency of occurrence scope the biggest corresponding correction ginseng
Number, is then negative for the corrected parameter that the frequency of occurrence scope less than frequency threshold value is corresponding, as candidate is real
The frequency of occurrence of name A is more than the frequency of occurrence of candidate's real name B, then the corrected parameter that candidate's real name A is corresponding is more
Greatly, the posterior probability that i.e. candidate's real name A is corresponding will increase more numerical value;And for example the going out of candidate's real name C
The existing frequency less than frequency threshold value, then needs to reduce the posterior probability that candidate's real name C is corresponding.Described identical spelling
The weight of sound refers to multiple different proportion range and multiple different correction ginsengs from the mapping relations of corrected parameter
Mapping relations between number, the biggest proportion range the biggest corresponding corrected parameter, and for less than weight threshold
The corrected parameter that the proportion range of value is corresponding can be then negative, and it is standby that the quantity of full pinyin as identical in certain takies family
The ratio of note informational capacity is the biggest, then the weight of this identical full pinyin is the biggest, then this identical full pinyin is corresponding
Corrected parameter is the biggest, i.e. can rise to the posteriority that multiple candidate's real names of this identical full pinyin are the most corresponding
Probability.The character complexity of described candidate's real name refers to multiple different character from the mapping relations of corrected parameter
Mapping relations between complexity from multiple different corrected parameters, the biggest character complexity is corresponding the biggest
Corrected parameter, is difficult to write and the Chinese of the most common (the biggest character complexity) as certain candidate's real name comprises
Word, then this candidate's real name can corresponding bigger corrected parameter, this candidate real name i.e. can be greatly improved corresponding
Posterior probability.The described character length of candidate's real name refers to multiple different from the mapping relations of corrected parameter
Mapping relations between character length from multiple different corrected parameters, the longest character length is corresponding the biggest
Corrected parameter, as the character length of candidate's real name A is more than the character length of candidate's real name B, then candidate's real name
A can corresponding bigger corrected parameter, the posteriority that i.e. can more greatly improve candidate's real name A corresponding is general
Rate.The popularity of described surname refers to multiple different surname popularity from the mapping relations of corrected parameter
From the mapping relations between multiple different corrected parameters, the corrected parameter that the most universal surname is corresponding is the biggest,
Can be then negative for the corrected parameter corresponding less than the surname of popularity threshold value, such as surname " king "
The corrected parameter that corresponding corrected parameter is more corresponding than surname " Ouyang " is big.Therefore, described server can root
According to the described a kind of mapping relations adjusted in power rule or the combination of multiple mapping relations, to described each candidate's real name
Posterior probability corresponding respectively is modified that (process of correction can be to increase posterior probability, it is also possible to is fall
Low posterior probability), and using candidate's real name of maximum revised posterior probability as described user's to be excavated
Optimum real name.
It is standby that the embodiment of the present invention goes out at least one candidate user by mining analysis in multiple user's remark informations
Note information, and by candidate user remark information identical at least one candidate user remark information described respectively
As candidate's real name, and according to the phonetic of each candidate user remark information correspondence respectively, add up each identical phonetic
Distinguish corresponding frequency of occurrence, and distinguish corresponding frequency of occurrence and described each time according to described each identical phonetic
Select the frequency of occurrence that real name is the most corresponding, calculate the posterior probability that described each candidate's real name is the most corresponding, and work as
When maximum posterior probability is more than predetermined probabilities threshold value, can using candidate's real name of maximum posterior probability as
The optimum real name of described user to be excavated, such that it is able to realize in the case of user does not provide real name based on
User's remark information accurately analyzes the real name of user, and then can be based on the abundant social activity of the real name analyzed
The various functions of network;And when maximum posterior probability is less than or equal to predetermined probabilities threshold value, it is also possible to
According to the power rule of adjusting preset, the posterior probability that described each candidate's real name is corresponding respectively is modified further,
And using candidate's real name of maximum revised posterior probability as the optimum real name of described user to be excavated, from
And the identification accuracy to real name can be improved further.
Refer to Fig. 3, be the flow process signal of another data mining processing method that the embodiment of the present invention provides
Figure, described method may include that
S301, obtains the multiple user remark informations corresponding with user to be excavated, and standby the plurality of user
In note information, mining analysis goes out at least one candidate user remark information, and by least one candidate user described
Candidate user remark information identical in remark information is respectively as candidate's real name;
S302, according to the phonetic that each candidate user remark information is the most corresponding, adds up each identical phonetic the most right
The frequency of occurrence answered, and distinguish corresponding frequency of occurrence and described each candidate's real name according to described each identical phonetic
The most corresponding frequency of occurrence, calculates the posterior probability that described each candidate's real name is the most corresponding;
Wherein, during the specific implementation of S301 to S302 step may refer to above-mentioned Fig. 1 correspondence embodiment
S101 to S102, is not discussed here.
S303, it is judged that whether maximum posterior probability is more than predetermined probabilities threshold value;
Concrete, after described server calculates the posterior probability that described each candidate's real name is the most corresponding, permissible
Determine whether that whether the posterior probability of maximum is more than predetermined probabilities threshold value.
S304, using candidate's real name of the posterior probability of described maximum as the optimum real name of described user to be excavated;
Concrete, if S303 is judged as YES, illustrate that the posterior probability of described maximum has enough credibilitys,
Therefore, it can candidate's real name optimum real name as described user to be excavated of posterior probability using described maximum,
To ensure that described optimum real name is exactly the real real name of described user to be excavated.
S305, the user remarks real name custom value respectively corresponding according to described each candidate user remark information and
The posterior probability that described each candidate's real name is the most corresponding, calculates the sequence power that described each candidate's real name is the most corresponding
Weight values, and using candidate's real name of maximum weight order value as the optimum real name of described user to be excavated;
Concrete, if S303 is judged as NO, the most described server can obtain described each candidate user remarks letter
The remarks attribute of the user (i.e. described user to be excavated is carried out the user of remarks) that breath is the most corresponding, one
The remarks attribute of user include this user good friend is carried out in remarks for real name user's remark information quantity and should
User carries out the quantity of all user's remark informations of remarks to good friend, and described server is further according to described remarks
User's remarks real name custom value that each candidate user remark information described in property calculation is the most corresponding, wherein, institute
State user's remarks real name custom value and refer to that user carries out the user's remark information quantity in remarks for real name to good friend
With this user good friend carried out the ratio of the quantity of all user's remark informations of remarks.Such as, certain candidate
The user (described user to be excavated i.e. carries out the user of remarks) that user's remark information is corresponding is user A,
If the quantity that user A carries out all user's remark informations that remarks are generated to other people is 100, and this 100
Having 70 user's remark informations in individual user's remark information is real real name, then can calculate user A's
User's remarks real name custom value is 70/100.Described server calculates described each candidate user remark information and divides
After not corresponding user's remarks real name custom value, can be the most right according to described each candidate user remark information
The posterior probability of the user's remarks real name custom value answered and described each candidate's real name correspondence respectively, calculates described
The weight order value that each candidate's real name is the most corresponding, and using candidate's real name of maximum weight order value as institute
State the optimum real name of user to be excavated.
Wherein, the described user remarks real name custom value the most corresponding according to described each candidate user remark information
And the posterior probability that described each candidate's real name is the most corresponding, calculate the row that described each candidate's real name is the most corresponding
The detailed process of sequence weighted value can be: as a example by one of them candidate's real name A, and described server can be by
(content of these multiple candidate user remark informations is equal for multiple candidate user remark informations corresponding to candidate's real name A
For candidate's real name A) it is defined as multiple target candidate user's remark information, then calculate the plurality of target and wait
Select the meansigma methods of user's remarks real name custom value that family remark information is the most corresponding;Again by described meansigma methods with
Posterior probability corresponding to candidate's real name A carries out the weight order value being added to obtain correspondence, or can be by institute
State meansigma methods plus being multiplied by posterior probability corresponding to candidate's real name A after a certain coefficient again to obtain the sequence of correspondence
Weighted value, other candidate's real name is all based on identical Computing Principle and calculates the weight order value of correspondence.
Optionally, if the maximum revised posteriority that calculated of the S205 in above-mentioned Fig. 2 correspondence embodiment
Probability still less than described predetermined probabilities threshold value, then can calculate revised posteriority with the Computing Principle of S305
The weight order value that probability is corresponding, to determine optimum real name more accurately.
Optionally, if the maximum weight order value that S305 is calculated is still less than described predetermined probabilities threshold value,
Then with the Computing Principle of the S205 in above-mentioned Fig. 2 correspondence embodiment, weight order value can be modified, with
Determine optimum real name more accurately.
It is standby that the embodiment of the present invention goes out at least one candidate user by mining analysis in multiple user's remark informations
Note information, and by candidate user remark information identical at least one candidate user remark information described respectively
As candidate's real name, and according to the phonetic of each candidate user remark information correspondence respectively, add up each identical phonetic
Distinguish corresponding frequency of occurrence, and distinguish corresponding frequency of occurrence and described each time according to described each identical phonetic
Select the frequency of occurrence that real name is the most corresponding, calculate the posterior probability that described each candidate's real name is the most corresponding, and work as
When maximum posterior probability is more than predetermined probabilities threshold value, can using candidate's real name of maximum posterior probability as
The optimum real name of described user to be excavated, such that it is able to realize in the case of user does not provide real name based on
User's remark information accurately analyzes the real name of user, and then can be based on the abundant social activity of the real name analyzed
The various functions of network;And when maximum posterior probability is less than or equal to predetermined probabilities threshold value, it is also possible to
Further according to the respectively corresponding user's remarks real name custom value of described each candidate user remark information and described
The posterior probability that each candidate's real name is the most corresponding, calculates the weight order value that described each candidate's real name is the most corresponding,
And using candidate's real name of maximum weight order value as the optimum real name of described user to be excavated, such that it is able to
Improve the identification accuracy to real name further.
Refer to Fig. 4, be the flow process signal of another data mining processing method that the embodiment of the present invention provides
Figure, described method may include that
S401, obtains the multiple user remark informations corresponding with user to be excavated, and standby the plurality of user
In note information, mining analysis goes out at least one candidate user remark information, and by least one candidate user described
Candidate user remark information identical in remark information is respectively as candidate's real name;
S402, according to the phonetic that each candidate user remark information is the most corresponding, adds up each identical phonetic the most right
The frequency of occurrence answered, and distinguish corresponding frequency of occurrence and described each candidate's real name according to described each identical phonetic
The most corresponding frequency of occurrence, calculates the posterior probability that described each candidate's real name is the most corresponding;
Wherein, during the specific implementation of S401 to S402 step may refer to above-mentioned Fig. 1 correspondence embodiment
S101 to S102, is not discussed here.
S403, it is judged that whether maximum posterior probability is more than predetermined probabilities threshold value;
Concrete, after described server calculates the posterior probability that described each candidate's real name is the most corresponding, permissible
Determine whether that whether the posterior probability of maximum is more than predetermined probabilities threshold value.
S404, using candidate's real name of the posterior probability of described maximum as the optimum real name of described user to be excavated;
Concrete, if S403 is judged as YES, illustrate that the posterior probability of described maximum has enough credibilitys,
Therefore, it can candidate's real name optimum real name as described user to be excavated of posterior probability using described maximum,
To ensure that described optimum real name is exactly the real real name of described user to be excavated.
S405, selects the candidate user remarks that candidate's real name of maximum with second largest posterior probability is the most corresponding
Information, and to selected go out candidate user remark information carry out feature extraction, and according to the described spy of extraction
Candidate's real name of the maximum and second largest posterior probability is marked by the sequence rank model seeking peace default, and
Using the high candidate's real name optimum real name as described user to be excavated of marking;
Concrete, if S403 is judged as NO, the most described server can select the maximum and second largest posteriority general
The candidate user remark information that candidate's real name of rate is the most corresponding, and to selected go out candidate user remarks letter
Breath carries out feature extraction, and the described feature and default sequence rank model according to extraction is to maximum and second
Candidate's real name of big posterior probability is marked, and using candidate's real name high for scoring as described use to be excavated
The optimum real name at family.Wherein, described rank model can be rank model based on pairwise.Described spy
Levy can include candidate user remark information corresponding cut the total character length of user's remark information before word, surname
Character length before Ming, the character length after name, total character length of candidate user remark information, wait to dig
User's remarks real name custom value of pick user, the user that candidate user remark information is corresponding (treats digging user
Carry out the user of remarks) user's remarks real name custom value.
Before using rank model to mark, need to set up and training rank model, set up and train rank
The detailed process of model can be: obtains corresponding with the user of known users real name being used for and trains rank model
Multiple training user's remark informations, and by identical training user in the plurality of training user's remark information
Remark information is respectively as training candidate's real name;By corresponding to training candidate's real name of described user's real name
Each training user's remark information supports set as first;Described first supports corresponding first scoring values of set;
Each by corresponding to training candidate's real name of non-described user's real name and the full pinyin with described user's real name
Training user's remark information supports set as second;Described second supports corresponding second scoring values of set,
Described first scoring values is more than described second scoring values;Extract the described first feature supporting set and institute
State the feature of the second support set, and according to the described first feature supporting set and described first scoring values,
Described second feature supporting set and described second scoring values are set up and train rank model.Therefore, base
The process marked candidate's real name of the maximum and second largest posterior probability in rank model can be: root
According to the support set belonging to multiple candidate user remark informations that two the candidate's real names inputted are the most corresponding
The scoring values of (being that the first support set or second supports set), calculates two candidate's real names pair respectively
The final scoring answered.
Optionally, if the maximum revised posteriority that calculated of the S205 in above-mentioned Fig. 2 correspondence embodiment
Probability, then can be based on rank model in the maximum and second largest correction still less than described predetermined probabilities threshold value
After candidate's real name corresponding to posterior probability in select optimum real name.
Optionally, if the maximum weight order value that the S305 in above-mentioned Fig. 3 correspondence embodiment is calculated depends on
So less than described predetermined probabilities threshold value, then can be based on rank model in the maximum and second largest weight order value
Corresponding candidate's real name is selected optimum real name.
It is standby that the embodiment of the present invention goes out at least one candidate user by mining analysis in multiple user's remark informations
Note information, and by candidate user remark information identical at least one candidate user remark information described respectively
As candidate's real name, and according to the phonetic of each candidate user remark information correspondence respectively, add up each identical phonetic
Distinguish corresponding frequency of occurrence, and distinguish corresponding frequency of occurrence and described each time according to described each identical phonetic
Select the frequency of occurrence that real name is the most corresponding, calculate the posterior probability that described each candidate's real name is the most corresponding, and work as
When maximum posterior probability is more than predetermined probabilities threshold value, can using candidate's real name of maximum posterior probability as
The optimum real name of described user to be excavated, such that it is able to realize in the case of user does not provide real name based on
User's remark information accurately analyzes the real name of user, and then can be based on the abundant social activity of the real name analyzed
The various functions of network;And when maximum posterior probability is less than or equal to predetermined probabilities threshold value, it is also possible to
It is based further on rank model and selects optimum real in candidate's real name of the maximum and second largest posterior probability
Name, such that it is able to improve the identification accuracy to real name further.
Refer to Fig. 5, be the structural representation of a kind of data mining processing means that the embodiment of the present invention provides,
Described data mining processing means 1 can apply in server based on social networks, described data mining
Processing means 1 may include that acquisition is excavated module 10, computing module 20, determined module 30;
Described acquisition excavates module 10, for obtaining the multiple user remark informations corresponding with user to be excavated,
And mining analysis goes out at least one candidate user remark information in the plurality of user's remark information, and by institute
State candidate user remark information identical at least one candidate user remark information respectively as candidate's real name;
Concrete, the described excavation module 10 that obtains can obtain the multiple user remarks corresponding with user to be excavated
Information, wherein, described user to be excavated refers to that server need to analyze the user identifying its real real name,
The plurality of user's remark information refers to that other good friend users carry out the information of remarks to described user to be excavated.
Such as, described user to be excavated has 100 good friend users, 100 good friend users to have 75 good friend users couple
Described user to be excavated carries out remarks, then can be using the information of these 75 good friend institute remarks as the plurality of use
Family remark information.Described acquisition excavates module 10 mining analysis in the plurality of user's remark information further
Go out at least one candidate user remark information, and by identical at least one candidate user remark information described
Candidate user remark information is respectively as candidate's real name.Such as, at least one candidate user remark information described
In have 20 candidate user remark informations to be " king AB ", 3 candidate user remark informations be " yellow AC ",
15 candidate user remark informations are " yellow AB ", 30 candidate user remark informations are " king AC ",
Then can be using " king AB ", " yellow AC ", " yellow AB ", " king AC " all as described candidate's real name.
Further, please also refer to Fig. 6, it is that a kind of acquisition that the embodiment of the present invention provides excavates module 10
Structural representation, described acquisition excavate module 10 may include that acquisition screening unit 101, delete determine
Unit 102;
Described acquisition screening unit 101, for obtaining the multiple user remark informations corresponding with user to be excavated,
And filter out full in the plurality of user's remark information according to name tactical rule and the surname matching list preset
First kind user's remark information of foot surname condition;
Concrete, described name tactical rule can be made a comment or criticism the number of words of normal name, as normal name is generally 2
To 4 Chinese characters (name of monosyllabic name is 2 to 3 Chinese characters, and the name of two-character surname is 3 to 4 Chinese characters).Therefore,
The multiple users got can first be entered based on effectively cutting word algorithm by described acquisition screening unit 101 for information
Row is cut word and (if user's remark information is " he is king AB ", is then cut the user's remark information after word and become " king
AB "), then by comprise 2 to 4 Chinese characters cut word after user's remark information screen, tentatively sieved
Select family remark information, afterwards further according to the monosyllabic name set in default surname matching list to comprising 2 words
Preliminary screening user's remark information mates, and comprises Preliminary screening user's remark information of 2 words with detection
First Chinese character whether be present in described monosyllabic name set, if exist, it is determined that comprise the preliminary of 2 words
Screening user's remark information meets surname condition and as first kind user's remark information, is otherwise picked
Remove;Two-character surname set in the described acquisition screening unit 101 surname matching list that basis is preset the most simultaneously is to comprising 4
Preliminary screening user's remark information of individual word mates, and the Preliminary screening user comprising 4 words with detection is standby
Whether the first two Chinese character of note information is present in described two-character surname set, if existing, it is determined that comprise 4 words
Preliminary screening user's remark information meet surname condition and as first kind user's remark information, otherwise
Rejected;Described acquisition screening unit 101 is gone back simultaneously according to described monosyllabic name set and described two-character surname set pair
The Preliminary screening user's remark information comprising 3 words mates, and comprises the Preliminary screening of 3 words with detection
First Chinese character of user's remark information whether is present in described monosyllabic name set or whether the first two Chinese character is present in
Described two-character surname set, as long as detecting and meeting one of them condition, i.e. may determine that and comprises the preliminary of 3 words
Screening user's remark information meets surname condition and as first kind user's remark information, if being all unsatisfactory for
Then rejected.
Described deletion determines unit 102, for by described first kind user's remark information comprises proper noun and
/ or high frequency words user's remark information delete, and will delete after remaining first kind user's remark information determine
For at least one candidate user remark information, and by identical at least one candidate user remark information described
Candidate user remark information is respectively as candidate's real name;
Wherein, described proper noun can include such as proprietary role's words such as teacher, master worker, sir, Miss,
Described high frequency words can include as tomorrow, the day after tomorrow, have a meal, drink water the contour existing word that occurs frequently.Such as, if certain
Individual first kind user's remark information is " teacher Wang ", the most described deletion determine unit 102 may determine that this first
Class user's remark information comprises proper noun, therefore, described deletion determine unit 102 can delete this first
Class user's remark information.
Described computing module 20, for the phonetic the most corresponding according to each candidate user remark information, statistics is each
The frequency of occurrence that identical phonetic is respectively corresponding, and according to frequency of occurrence corresponding to described each identical phonetic and
The frequency of occurrence that described each candidate's real name is the most corresponding, the posteriority calculating described each candidate's real name the most corresponding is general
Rate;
Concrete, please also refer to Fig. 7, it is the structure of a kind of computing module 20 that the embodiment of the present invention provides
Schematic diagram, described computing module 20 may include that phonetic acquiring unit 201, frequency statistics unit 202,
First probability calculation unit the 203, second probability calculation unit 204;
Described phonetic acquiring unit 201, for obtaining the full pinyin that each candidate user remark information is the most corresponding,
Described full pinyin includes surname phonetic and name phonetic;
Concrete, described phonetic acquiring unit 201 can obtain at least one candidate user remark information described
In the respectively corresponding full pinyin of each candidate user remark information, described full pinyin includes that surname phonetic and name are spelled
Sound.Such as, certain candidate user remark information is " Zhang Xiaobo ", then corresponding full pinyin is " zhang xiaobo ",
Wherein, surname phonetic is " zhang ", and name phonetic is " xiaobo ".
Described frequency statistics unit 202, for adding up each identical surname according to described each candidate user remark information
Phonetic distinguishes the frequency of occurrence that corresponding frequency of occurrence is the most corresponding with each same name phonetic;
Described first probability calculation unit 203, for the appearance frequency the most corresponding according to each identical surname phonetic
The frequency of occurrence of same name phonetic secondary, each correspondence respectively and candidate user remark information total amount, calculate each
The joint probability that identical full pinyin is the most corresponding;
Described second probability calculation unit 204 is corresponding for the identical full pinyin according to maximum joint probability
The frequency of occurrence of frequency of occurrence and described each candidate's real name correspondence respectively, calculates described each candidate's real name respectively
Corresponding posterior probability;
Wherein, the computing formula of described posterior probability is: posterior probability P (candidate's real name | optimal full pinyin)=
The frequency of occurrence of the frequency of occurrence of the candidate's real name in optimal full pinyin/optimal full pinyin, described optimal spelling
Sound refers to the identical full pinyin of joint probability of maximum, wherein, if the full pinyin of candidate's real name be not described
Good full pinyin, then the frequency of occurrence of this candidate's real name in optimal full pinyin is 0.
Further, then please also refer to Fig. 8, it is one the first probability calculation list of embodiment of the present invention offer
The structural representation of unit 203, described first probability calculation unit 203 may include that the first probability calculation
Unit the 2031, second probability calculation subelement 2032, joint probability calculation subelement 2033;
Described first probability calculation subelement 2031, for the appearance the most corresponding according to each identical surname phonetic
The frequency and candidate user remark information total amount, calculate described each identical surname phonetic respectively corresponding first general
Rate;
Described second probability calculation subelement 2032, for the appearance the most corresponding according to each same name phonetic
The frequency and candidate user remark information total amount, calculate described each same name phonetic respectively corresponding second general
Rate;
Described joint probability calculation subelement 2033, for each described first probability and each described second probability
Calculate, the joint probability the most corresponding to obtain each identical full pinyin;
Wherein, the computing formula of described joint probability is: joint probability PFull pinyin=PSurname phonetic*PName phonetic,
PSurname phoneticIt is described first probability, PName phoneticIt is described second probability.
Such as, at least one candidate user remark information described includes 30 " Wu Xiaobo ", 20 " Wu is little
Ripple ", 10 " Wu Xiaobo ", 10 " Zhang Xiaobo " and 30 " Zhang Haibo ", the most identical full pinyin
Including " wu xiaobo ", " zhang xiaobo ", " zhang haibo ", the most described first probability calculation subelement
2031 P that can calculate identical surname phonetic " wu "Surname phoneticFrequency of occurrence/the candidate user of=" wu " is standby
Note informational capacity=60/100, described first probability calculation subelement 2031 calculates identical surname phonetic
The P of " zhang "Surname phoneticThe frequency of occurrence of=" zhang "/candidate user remark information total amount=40/100, institute
State the second probability calculation subelement 2032 and can calculate the P of same name phonetic " xiaobo "Name phonetic=
The frequency of occurrence of " xiaobo "/candidate user remark information total amount=70/100, described second probability calculation sub-list
Unit 2032 can calculate the P of same name phonetic " haibo "Name phoneticFrequency of occurrence/the candidate of=" haibo "
User's remark information total amount=30/100;Thus described joint probability calculation subelement 2033 can calculate identical
Joint probability P of full pinyin " wu xiaobo "Full pinyinThe P of=identical surname phonetic " wu "Surname phoneticMutually the most of the same name
The P of word phonetic " xiaobo "Name phonetic=42/100, described joint probability calculation subelement 2033 calculates identical
Joint probability P of full pinyin " zhang xiaobo "Full pinyinThe P of=identical surname phonetic " zhang "Surname phonetic* phase
P with name phonetic " xiaobo "Name phonetic=28/100, described joint probability calculation subelement 2033 calculates
Joint probability P of identical full pinyin " zhang haibo "Full pinyinThe P of=identical surname phonetic " zhang "Surname phonetic
* the P of same name phonetic " haibo "Name phonetic=12/100;As can be seen here, identical full pinyin " wu xiaobo "
Joint probability maximum, therefore, using identical full pinyin " wu xiaobo " as optimal full pinyin;Described
Two probability calculation unit 204 can calculate further " Wu Xiaobo " posterior probability P (Wu Xiaobo | optimal
Full pinyin " wu xiaobo ")=30/60, described second probability calculation unit 204 calculates " Wu little Bo "
Posterior probability P (Wu little Bo | optimal full pinyin " wu xiaobo ")=20/60, described second probability calculation unit
204 posterior probability P calculating " Wu Xiaobo " (Wu Xiaobo | optimal full pinyin " wu xiaobo ")=10/60,
Described second probability calculation unit 204 calculate " Zhang Xiaobo " posterior probability P (Zhang Xiaobo | optimal spelling
Sound " wu xiaobo ")=0, described second probability calculation unit 204 calculates the posterior probability of " Zhang Haibo "
P (Zhang Haibo | optimal full pinyin " wu xiaobo ")=0.
Described determine module 30, for using candidate's real name of maximum posterior probability as described user to be excavated
Optimum real name;
Concrete, after calculating the posterior probability that described each candidate's real name is the most corresponding, described determine module 30
Can be using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated, i.e. can be by
What described optimum real name was defined as described user to be excavated is really real name, such that it is able to realize the real name to user
Accurately identify.Such as, candidate's real name includes " Wu Xiaobo ", " Wu little Bo ", " Wu Xiaobo ", " Zhang Xiao
Ripple ", " Zhang Haibo ", wherein, the posterior probability of " Wu Xiaobo " is 30/60, the posterior probability of " Wu little Bo "
Be 20/60, the posterior probability of " Wu Xiaobo " be 10/60, the posterior probability of " Zhang Xiaobo " be 0, " Zhang Haibo "
Posterior probability be 0, the most described determine that " Wu Xiaobo " of maximum posterior probability can be determined by module 30
Optimum real name for described user to be excavated.
Further, please also refer to Fig. 9, it it is a kind of knot determining module 30 of embodiment of the present invention offer
Structure schematic diagram, described determine module 30 may include that the first judging unit 301, first determine unit 302,
Correction determines unit the 303, second judging unit 304, second determines that unit 305, weight calculation determine unit
306, the 3rd judging unit the 307, the 3rd determines that unit 308, model score determine unit 309;
Described first judging unit 301, for judging that whether the posterior probability of maximum is more than predetermined probabilities threshold value;
Described first determines unit 302, if being judged as YES, then by described for described first judging unit 301
Candidate's real name of maximum posterior probability is as the optimum real name of described user to be excavated;
Described correction determines unit 303, if being judged as NO for described first judging unit 301, then according to pre-
If power rule of adjusting the posterior probability that described each candidate's real name is corresponding respectively is modified, and maximum is repaiied
Candidate's real name of the posterior probability after just is as the optimum real name of described user to be excavated;
Wherein, described power rule is adjusted to include: the frequency of occurrence of candidate's real name and the mapping relations of corrected parameter,
The weight of identical full pinyin and the mapping relations of corrected parameter, the character complexity of candidate's real name and corrected parameter
Mapping relations, character length and the mapping relations of corrected parameter of candidate's real name, surname popularity with
At least one mapping relations in the mapping relations of corrected parameter.Described first judging unit 301, described first
Determine that unit 302 and described correction determine that the specific implementation of unit 303 may refer to above-mentioned Fig. 2 pair
Answer the S201-S205 in embodiment, be not discussed here.
Described second judging unit 304, for judging that whether the posterior probability of maximum is more than predetermined probabilities threshold value;
Described second determines unit 305, if being judged as YES, then by described for described second judging unit 304
Candidate's real name of maximum posterior probability is as the optimum real name of described user to be excavated;
Described weight calculation determines unit 306, if being judged as NO for described second judging unit 304, then root
The user remarks real name custom value the most corresponding according to described each candidate user remark information and described each candidate are real
The posterior probability that name is the most corresponding, calculates the weight order value that described each candidate's real name is the most corresponding, and will be
Candidate's real name of big weight order value is as the optimum real name of described user to be excavated;
Wherein, described user's remarks real name custom value refers to that user carries out the user in remarks for real name to good friend
Remark information quantity and this user carry out the ratio of the quantity of all user's remark informations of remarks to good friend.Institute
State the second judging unit 304, described second determine that unit 305 and described weight calculation determine unit 306
Specific implementation may refer to the S301-S305 in above-mentioned Fig. 3 correspondence embodiment, is not discussed here.
Described 3rd judging unit 307, for judging that whether the posterior probability of maximum is more than predetermined probabilities threshold value;
Described 3rd determines unit 308, if being judged as YES, then by described for described 3rd judging unit 307
Candidate's real name of maximum posterior probability is as the optimum real name of described user to be excavated;
Described model score determines unit 309, if being judged as NO for described 3rd judging unit 307, then selects
Select the candidate user remark information that candidate's real name of maximum with second largest posterior probability is the most corresponding, and to institute
The candidate user remark information selected carries out feature extraction, and according to the described feature extracted and the row of presetting
Candidate's real name of the maximum and second largest posterior probability is marked by sequence rank model, and by time high for scoring
Select real name as the optimum real name of described user to be excavated;
Wherein, described 3rd judging unit 307, the described 3rd determine that unit 308 and described model score are true
The specific implementation of cell 309 may refer to the S401-S405 in above-mentioned Fig. 4 correspondence embodiment, here
No longer repeat.
Optionally, when described first judging unit 301, described first determine that unit 302 and described correction are true
Cell 303 perform corresponding operating time, described second judging unit 304, described second determine unit 305,
Described weight calculation determine unit 306, described 3rd judging unit 307, the described 3rd determine unit 308 with
And described model score determines that unit 309 all quits work.When described second judging unit 304, described second
Determine unit 305 and described weight calculation determine unit 306 perform corresponding operating time, described first sentences
Disconnected unit 301, described first determine unit 302, described correction determine unit 303, the described 3rd judge single
Unit the 307, the described 3rd determines that unit 308 and described model score determine that unit 309 all quits work.When
Described 3rd judging unit 307, the described 3rd determine that unit 308 and described model score determine unit 309
When performing corresponding operating, described first judging unit 301, described first determine unit 302, described correction
Determine unit 303, described second judging unit 304, described second determine unit 305 and described weight meter
Calculation determines that unit 306 all quits work.Wherein, described first judging unit 301, described second judging unit
304 and described 3rd judging unit 307 can be same judging unit;Described first determine unit 302,
Described second determines that unit 305 and the described 3rd determines that unit 308 can be same to determine unit.
It is standby that the embodiment of the present invention goes out at least one candidate user by mining analysis in multiple user's remark informations
Note information, and by candidate user remark information identical at least one candidate user remark information described respectively
As candidate's real name, and according to the phonetic of each candidate user remark information correspondence respectively, add up each identical phonetic
Distinguish corresponding frequency of occurrence, and distinguish corresponding frequency of occurrence and described each time according to described each identical phonetic
Select the frequency of occurrence that real name is the most corresponding, calculate the posterior probability that described each candidate's real name is the most corresponding, finally
Using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated, such that it is able to realize
Accurately analyze the real name of user based on user's remark information in the case of user does not provide real name, and then
The various functions of social networks can be enriched based on the real name analyzed.
Refer to Figure 10 again, be the structure of the another kind of data mining processing means 1 that the embodiment of the present invention provides
Schematic diagram, described data mining processing means 1 can apply in server based on social networks, described
Data mining processing means 1 can include that the acquisition in above-mentioned Fig. 5 correspondence embodiment is excavated module 10, calculated
Module 20, determining module 30, further, described data mining processing means 1 can also include: obtains
Determine that module 40, set determine module 50, model training module 60;
Described acquisition determines module 40, trains for obtaining corresponding with the user of known users real name being used for
Multiple training user's remark informations of rank model, and by identical in the plurality of training user's remark information
Training user's remark information is respectively as training candidate's real name;
Described set determines module 50, for by for described user's real name training candidate's real name corresponding to each
Training user's remark information supports set as first;Described first supports corresponding first scoring values of set;
Described set determines module 50, is additionally operable to for non-described user's real name and have described user's real name
Each training user's remark information corresponding to training candidate's real name of full pinyin supports set as second;Described
Second supports corresponding second scoring values of set, and described first scoring values is more than described second scoring values;
Described model training module 60, supports for extracting the described first feature supporting set and described second
Set feature, and according to described first support set feature and described first scoring values, described second
The feature and described second scoring values that support set are set up and train rank model;
Wherein, determine that module 40, described set determine module 50 and described model training by described acquisition
After rank model is set up and trained to module 60, the model score in above-mentioned Fig. 9 correspondence embodiment can be made true
Cell 309 is according to multiple candidate user remarks of the correspondence respectively of two candidate's real names in input rank model
The scoring values of the support set (being that the first support set or second supports set) belonging to information, counts respectively
Calculate two final scorings corresponding to candidate's real name.
Refer to Figure 11 again, be the structural representation of a kind of server that the embodiment of the present invention provides, such as Figure 11
Shown in, described server 1000 may include that at least one processor 1001, such as CPU, at least one
Individual network interface 1004, user interface 1003, memorizer 1005, at least one communication bus 1002.Its
In, communication bus 1002 is for realizing the connection communication between these assemblies.Wherein, user interface 1003
Can include display screen (Display), keyboard (Keyboard), optional user interface 1003 can also include
The wireline interface of standard, wave point.Network interface 1004 optionally can include standard wireline interface,
Wave point (such as WI-FI interface).Memorizer 1005 can be high-speed RAM memorizer, it is also possible to right and wrong
Unstable memorizer (non-volatile memory), for example, at least one disk memory.Memorizer 1005
Optionally can also is that at least one is located remotely from the storage device of aforementioned processor 1001.As shown in figure 11,
As the memorizer 1005 of a kind of computer-readable storage medium can include operating system, network communication module,
Subscriber Interface Module SIM and equipment control application program.
In the server 1000 shown in Figure 11, network interface 1004 is mainly used in connecting client, to connect
Receive user's remark information that client sends;And user interface 1003 is mainly used in providing the user connecing of input
Mouthful, obtain the data of user's output;And processor 1001 may be used for calling storage in memorizer 1005
Equipment controls application program, to realize
Obtain the multiple user remark informations corresponding with user to be excavated, and at the plurality of user's remark information
Middle mining analysis goes out at least one candidate user remark information, and is believed by least one candidate user remarks described
Candidate user remark information identical in breath is respectively as candidate's real name;
According to the phonetic that each candidate user remark information is the most corresponding, add up going out of each identical phonetic correspondence respectively
The existing frequency, and frequency of occurrence and the described each candidate's real name according to described each identical phonetic correspondence respectively is the most right
The frequency of occurrence answered, calculates the posterior probability that described each candidate's real name is the most corresponding;
Using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated.
In one embodiment, described processor 1001 is performing multiple use that acquisition is corresponding with user to be excavated
Family remark information, and mining analysis goes out at least one candidate user remarks in the plurality of user's remark information
Information, and identical candidate user remark information at least one candidate user remark information described is made respectively
During for candidate's real name, specifically perform:
Obtain the multiple user remark informations corresponding with user to be excavated, and according to name tactical rule with default
Surname matching list in the plurality of user's remark information, filter out that to meet the first kind user of surname condition standby
Note information;
The user's remark information comprising proper noun and/or high frequency words in described first kind user's remark information is deleted
Remove, and will delete after remaining first kind user's remark information be defined as at least one candidate user remarks letter
Breath, and using candidate user remark information identical at least one candidate user remark information described as
Candidate's real name.
In one embodiment, described processor 1001 is the most right according to each candidate user remark information in execution
The phonetic answered, adds up the frequency of occurrence that each identical phonetic is the most corresponding, and according to described each identical phonetic difference
The frequency of occurrence that corresponding frequency of occurrence is the most corresponding with described each candidate's real name, calculates described each candidate's real name
When distinguishing corresponding posterior probability, specifically perform:
Obtain the full pinyin that each candidate user remark information is respectively corresponding, described full pinyin include surname phonetic and
Name phonetic;
According to described each candidate user remark information add up the respectively corresponding frequency of occurrence of each identical surname phonetic and
The frequency of occurrence that each same name phonetic is the most corresponding;
According to going out of each identical surname phonetic correspondence respectively of the frequency of occurrence of correspondence, each same name phonetic respectively
The existing frequency and candidate user remark information total amount, calculate the joint probability that each identical full pinyin is the most corresponding;
Frequency of occurrence and described each candidate's real name that identical full pinyin according to maximum joint probability is corresponding are divided
Not corresponding frequency of occurrence, calculates the posterior probability that described each candidate's real name is the most corresponding.
In one embodiment, described processor 1001 is performing according to each identical surname phonetic correspondence respectively
The frequency of occurrence of frequency of occurrence, each same name phonetic correspondence respectively and candidate user remark information total amount,
When calculating the joint probability of each identical full pinyin correspondence respectively, specifically perform:
The frequency of occurrence the most corresponding according to each identical surname phonetic and candidate user remark information total amount, meter
Calculate the first probability that described each identical surname phonetic is the most corresponding;
The frequency of occurrence the most corresponding according to each same name phonetic and candidate user remark information total amount, meter
Calculate the second probability that described each same name phonetic is the most corresponding;
Each described first probability and each described second probability are calculated, to obtain each identical full pinyin respectively
Corresponding joint probability.
In one embodiment, described processor 1001 is performing candidate's real name work of maximum posterior probability
During for the optimum real name of described user to be excavated, specifically perform:
Judge that whether the posterior probability of maximum is more than predetermined probabilities threshold value;
If being judged as YES, then using candidate's real name of the posterior probability of described maximum as described user's to be excavated
Optimum real name;
If being judged as NO, then adjust power rule general to the posteriority that described each candidate's real name is the most corresponding according to preset
Rate is modified, and using candidate's real name of maximum revised posterior probability as described user's to be excavated
Optimum real name;
Wherein, described power rule is adjusted to include: the frequency of occurrence of candidate's real name and the mapping relations of corrected parameter,
The weight of identical full pinyin and the mapping relations of corrected parameter, the character complexity of candidate's real name and corrected parameter
Mapping relations, character length and the mapping relations of corrected parameter of candidate's real name, surname popularity with
At least one mapping relations in the mapping relations of corrected parameter.
In one embodiment, described processor 1001 is performing candidate's real name work of maximum posterior probability
During for the optimum real name of described user to be excavated, specifically perform:
Judge that whether the posterior probability of maximum is more than predetermined probabilities threshold value;
If being judged as YES, then using candidate's real name of the posterior probability of described maximum as described user's to be excavated
Optimum real name;
If being judged as NO, then practise according to user's remarks real name that described each candidate user remark information is the most corresponding
The posterior probability of used value and described each candidate's real name correspondence respectively, calculates described each candidate's real name correspondence respectively
Weight order value, and using candidate's real name of maximum weight order value as the optimum of described user to be excavated
Real name;
Wherein, described user's remarks real name custom value refers to that user carries out the user in remarks for real name to good friend
Remark information quantity and this user carry out the ratio of the quantity of all user's remark informations of remarks to good friend.
In one embodiment, described processor 1001 is performing candidate's real name work of maximum posterior probability
During for the optimum real name of described user to be excavated, specifically perform:
Judge that whether the posterior probability of maximum is more than predetermined probabilities threshold value;
If being judged as YES, then using candidate's real name of the posterior probability of described maximum as described user's to be excavated
Optimum real name;
If being judged as NO, then select the candidate that candidate's real name of maximum with second largest posterior probability is the most corresponding
User's remark information, and to selected go out candidate user remark information carry out feature extraction, and according to extraction
Described feature and preset sequence rank model candidate's real name of the maximum and second largest posterior probability is carried out
Scoring, and using the high candidate's real name optimum real name as described user to be excavated of marking.
In one embodiment, described processor 1001 also performs:
Obtain the multiple training user remarks for train rank model corresponding with the user of known users real name
Information, and using training user's remark information identical in the plurality of training user's remark information as instruction
Practice candidate's real name;
Using each training user's remark information corresponding to training candidate's real name of described user's real name as first
Support set;Described first supports corresponding first scoring values of set;
Corresponding to the training candidate's real name for non-described user's real name and the full pinyin with described user's real name
Each training user's remark information support set as second;Described second supports corresponding second goals for of set
Value, described first scoring values is more than described second scoring values;
Extract the described first feature supporting set and described second and support the feature of set, and according to described the
One feature supporting set and described first scoring values, the feature and described second of described second support set
Scoring values is set up and trains rank model;
Wherein, described first in the rank model after training supports set and described second and support set is to use
In the candidate's real name inputted is marked.
It is standby that the embodiment of the present invention goes out at least one candidate user by mining analysis in multiple user's remark informations
Note information, and by candidate user remark information identical at least one candidate user remark information described respectively
As candidate's real name, and according to the phonetic of each candidate user remark information correspondence respectively, add up each identical phonetic
Distinguish corresponding frequency of occurrence, and distinguish corresponding frequency of occurrence and described each time according to described each identical phonetic
Select the frequency of occurrence that real name is the most corresponding, calculate the posterior probability that described each candidate's real name is the most corresponding, finally
Using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated, such that it is able to realize
Accurately analyze the real name of user based on user's remark information in the case of user does not provide real name, and then
The various functions of social networks can be enriched based on the real name analyzed.
One of ordinary skill in the art will appreciate that all or part of flow process realizing in above-described embodiment method,
Can be by computer program and complete to instruct relevant hardware, described program can be stored in a calculating
In machine read/write memory medium, this program is upon execution, it may include such as the flow process of the embodiment of above-mentioned each method.
Wherein, described storage medium can be magnetic disc, CD, read-only store-memory body (Read-Only Memory,
Or random store-memory body (Random Access Memory, RAM) etc. ROM).
Above disclosed be only present pre-ferred embodiments, certainly can not with this limit the present invention it
Interest field, the equivalent variations therefore made according to the claims in the present invention, still belong to the scope that the present invention is contained.
Claims (16)
1. a data mining processing method, it is characterised in that including:
Obtain the multiple user remark informations corresponding with user to be excavated, and at the plurality of user's remark information
Middle mining analysis goes out at least one candidate user remark information, and is believed by least one candidate user remarks described
Candidate user remark information identical in breath is respectively as candidate's real name;
According to the phonetic that each candidate user remark information is the most corresponding, add up going out of each identical phonetic correspondence respectively
The existing frequency, and frequency of occurrence and the described each candidate's real name according to described each identical phonetic correspondence respectively is the most right
The frequency of occurrence answered, calculates the posterior probability that described each candidate's real name is the most corresponding;
Using candidate's real name of maximum posterior probability as the optimum real name of described user to be excavated.
2. the method for claim 1, it is characterised in that described acquisition is corresponding with user to be excavated
Multiple user's remark informations, and mining analysis goes out at least one candidate use in the plurality of user's remark information
Family remark information, and by candidate user remark information identical at least one candidate user remark information described
Respectively as candidate's real name, including:
Obtain the multiple user remark informations corresponding with user to be excavated, and according to name tactical rule with default
Surname matching list in the plurality of user's remark information, filter out that to meet the first kind user of surname condition standby
Note information;
The user's remark information comprising proper noun and/or high frequency words in described first kind user's remark information is deleted
Remove, and will delete after remaining first kind user's remark information be defined as at least one candidate user remarks letter
Breath, and using candidate user remark information identical at least one candidate user remark information described as
Candidate's real name.
3. the method for claim 1, it is characterised in that described according to each candidate user remark information
The most corresponding phonetic, adds up the frequency of occurrence that each identical phonetic is the most corresponding, and according to described each identical spelling
The frequency of occurrence that cent is not corresponding distinguishes the most corresponding frequency of occurrence with described each candidate's real name, calculates described each time
Select the posterior probability that real name is the most corresponding, including:
Obtain the full pinyin that each candidate user remark information is respectively corresponding, described full pinyin include surname phonetic and
Name phonetic;
According to described each candidate user remark information add up the respectively corresponding frequency of occurrence of each identical surname phonetic and
The frequency of occurrence that each same name phonetic is the most corresponding;
According to going out of each identical surname phonetic correspondence respectively of the frequency of occurrence of correspondence, each same name phonetic respectively
The existing frequency and candidate user remark information total amount, calculate the joint probability that each identical full pinyin is the most corresponding;
Frequency of occurrence and described each candidate's real name that identical full pinyin according to maximum joint probability is corresponding are divided
Not corresponding frequency of occurrence, calculates the posterior probability that described each candidate's real name is the most corresponding.
4. method as claimed in claim 3, it is characterised in that described according to each identical surname phonetic difference
The frequency of occurrence of corresponding frequency of occurrence, each same name phonetic correspondence respectively and candidate user remark information
Total amount, calculates the joint probability that each identical full pinyin is the most corresponding, including:
The frequency of occurrence the most corresponding according to each identical surname phonetic and candidate user remark information total amount, meter
Calculate the first probability that described each identical surname phonetic is the most corresponding;
The frequency of occurrence the most corresponding according to each same name phonetic and candidate user remark information total amount, meter
Calculate the second probability that described each same name phonetic is the most corresponding;
Each described first probability and each described second probability are calculated, to obtain each identical full pinyin respectively
Corresponding joint probability.
5. the method for claim 1, it is characterised in that the described candidate by maximum posterior probability
Real name as the optimum real name of described user to be excavated, including:
Judge that whether the posterior probability of maximum is more than predetermined probabilities threshold value;
If being judged as YES, then using candidate's real name of the posterior probability of described maximum as described user's to be excavated
Optimum real name;
If being judged as NO, then adjust power rule general to the posteriority that described each candidate's real name is the most corresponding according to preset
Rate is modified, and using candidate's real name of maximum revised posterior probability as described user's to be excavated
Optimum real name;
Wherein, described power rule is adjusted to include: the frequency of occurrence of candidate's real name and the mapping relations of corrected parameter,
The weight of identical full pinyin and the mapping relations of corrected parameter, the character complexity of candidate's real name and corrected parameter
Mapping relations, character length and the mapping relations of corrected parameter of candidate's real name, surname popularity with
At least one mapping relations in the mapping relations of corrected parameter.
6. the method for claim 1, it is characterised in that the described candidate by maximum posterior probability
Real name as the optimum real name of described user to be excavated, including:
Judge that whether the posterior probability of maximum is more than predetermined probabilities threshold value;
If being judged as YES, then using candidate's real name of the posterior probability of described maximum as described user's to be excavated
Optimum real name;
If being judged as NO, then practise according to user's remarks real name that described each candidate user remark information is the most corresponding
The posterior probability of used value and described each candidate's real name correspondence respectively, calculates described each candidate's real name correspondence respectively
Weight order value, and using candidate's real name of maximum weight order value as the optimum of described user to be excavated
Real name;
Wherein, described user's remarks real name custom value refers to that user carries out the user in remarks for real name to good friend
Remark information quantity and this user carry out the ratio of the quantity of all user's remark informations of remarks to good friend.
7. the method for claim 1, it is characterised in that the described candidate by maximum posterior probability
Real name as the optimum real name of described user to be excavated, including:
Judge that whether the posterior probability of maximum is more than predetermined probabilities threshold value;
If being judged as YES, then using candidate's real name of the posterior probability of described maximum as described user's to be excavated
Optimum real name;
If being judged as NO, then select the candidate that candidate's real name of maximum with second largest posterior probability is the most corresponding
User's remark information, and to selected go out candidate user remark information carry out feature extraction, and according to extraction
Described feature and preset sequence rank model candidate's real name of the maximum and second largest posterior probability is carried out
Scoring, and using the high candidate's real name optimum real name as described user to be excavated of marking.
8. method as claimed in claim 7, it is characterised in that also include:
Obtain the multiple training user remarks for train rank model corresponding with the user of known users real name
Information, and using training user's remark information identical in the plurality of training user's remark information as instruction
Practice candidate's real name;
Using each training user's remark information corresponding to training candidate's real name of described user's real name as first
Support set;Described first supports corresponding first scoring values of set;
Corresponding to the training candidate's real name for non-described user's real name and the full pinyin with described user's real name
Each training user's remark information support set as second;Described second supports corresponding second goals for of set
Value, described first scoring values is more than described second scoring values;
Extract the described first feature supporting set and described second and support the feature of set, and according to described the
One feature supporting set and described first scoring values, the feature and described second of described second support set
Scoring values is set up and trains rank model;
Wherein, described first in the rank model after training supports set and described second and support set is to use
In the candidate's real name inputted is marked.
9. a data mining processing means, it is characterised in that including:
Obtain and excavate module, for obtaining the multiple user remark informations corresponding with user to be excavated, and in institute
State mining analysis in multiple user's remark information and go out at least one candidate user remark information, and by described at least
Candidate user remark information identical in one candidate user remark information is respectively as candidate's real name;
Computing module, for the phonetic the most corresponding according to each candidate user remark information, adds up each identical spelling
The frequency of occurrence that cent is not corresponding, and according to frequency of occurrence corresponding to described each identical phonetic and described respectively
The frequency of occurrence that candidate's real name is the most corresponding, calculates the posterior probability that described each candidate's real name is the most corresponding;
Determine module, for using candidate's real name of maximum posterior probability as the optimum of described user to be excavated
Real name.
10. device as claimed in claim 9, it is characterised in that described acquisition is excavated module and included:
Obtain screening unit, for obtaining the multiple user remark informations corresponding with user to be excavated, and according to
Name tactical rule and the surname matching list preset filter out in the plurality of user's remark information and meet surname
First kind user's remark information of condition;
Deletion determines unit, for comprising proper noun and/or high frequency in described first kind user's remark information
User's remark information of word is deleted, and remaining first kind user's remark information is defined as at least after deleting
One candidate user remark information, and candidate identical at least one candidate user remark information described is used
Family remark information is respectively as candidate's real name.
11. devices as claimed in claim 9, it is characterised in that described computing module includes:
Phonetic acquiring unit, for obtaining the full pinyin that each candidate user remark information is respectively corresponding, described entirely
Phonetic includes surname phonetic and name phonetic;
Frequency statistics unit, divides for adding up each identical surname phonetic according to described each candidate user remark information
The frequency of occurrence that not corresponding frequency of occurrence is the most corresponding with each same name phonetic;
First probability calculation unit, for the frequency of occurrence the most corresponding according to each identical surname phonetic, each phase
The frequency of occurrence the most corresponding with name phonetic and candidate user remark information total amount, calculate each identical spelling
The joint probability that cent is not corresponding;
Second probability calculation unit, for the appearance frequency that the identical full pinyin according to maximum joint probability is corresponding
Secondary and described each candidate's real name distinguishes corresponding frequency of occurrence, calculates described each candidate's real name correspondence respectively
Posterior probability.
12. devices as claimed in claim 11, it is characterised in that described first probability calculation unit includes:
First probability calculation subelement, for the frequency of occurrence respectively corresponding according to each identical surname phonetic and
Candidate user remark information total amount, calculates the first probability that described each identical surname phonetic is the most corresponding;
Second probability calculation subelement, for the frequency of occurrence respectively corresponding according to each same name phonetic and
Candidate user remark information total amount, calculates the second probability that described each same name phonetic is the most corresponding;
Joint probability calculation subelement, for each described first probability and each described second probability are calculated,
The joint probability the most corresponding to obtain each identical full pinyin.
13. devices as claimed in claim 9, it is characterised in that described determine that module includes:
First judging unit, for judging that whether the posterior probability of maximum is more than predetermined probabilities threshold value;
First determines unit, if being judged as YES, then by the posteriority of described maximum for described first judging unit
Candidate's real name of probability is as the optimum real name of described user to be excavated;
Correction determines unit, if being judged as NO for described first judging unit, then according to the tune power rule preset
Then the posterior probability that described each candidate's real name is corresponding respectively is modified, and by maximum revised posteriority
Candidate's real name of probability is as the optimum real name of described user to be excavated;
Wherein, described power rule is adjusted to include: the frequency of occurrence of candidate's real name and the mapping relations of corrected parameter,
The weight of identical full pinyin and the mapping relations of corrected parameter, the character complexity of candidate's real name and corrected parameter
Mapping relations, character length and the mapping relations of corrected parameter of candidate's real name, surname popularity with
At least one mapping relations in the mapping relations of corrected parameter.
14. devices as claimed in claim 9, it is characterised in that described determine that module includes:
Second judging unit, for judging that whether the posterior probability of maximum is more than predetermined probabilities threshold value;
Second determines unit, if being judged as YES, then by the posteriority of described maximum for described second judging unit
Candidate's real name of probability is as the optimum real name of described user to be excavated;
Weight calculation determines unit, if being judged as NO, then according to described each time for described second judging unit
Select user's remarks real name custom value and described each candidate's real name correspondence respectively that family remark information is the most corresponding
Posterior probability, calculate the weight order value that described each candidate's real name is respectively corresponding, and by maximum sequence power
Candidate's real name of weight values is as the optimum real name of described user to be excavated;
Wherein, described user's remarks real name custom value refers to that user carries out the user in remarks for real name to good friend
Remark information quantity and this user carry out the ratio of the quantity of all user's remark informations of remarks to good friend.
15. devices as claimed in claim 9, it is characterised in that described determine that module includes:
3rd judging unit, for judging that whether the posterior probability of maximum is more than predetermined probabilities threshold value;
3rd determines unit, if being judged as YES, then by the posteriority of described maximum for described 3rd judging unit
Candidate's real name of probability is as the optimum real name of described user to be excavated;
Model score determines unit, if being judged as NO for described 3rd judging unit, then selects maximum and the
The candidate user remark information that candidate's real name of two big posterior probability is the most corresponding, and to selected go out time
Family remark information is selected to carry out feature extraction, and according to the described feature extracted and the sequence rank model of presetting
Candidate's real name of the maximum and second largest posterior probability is marked, and using the high candidate's real name of scoring as
The optimum real name of described user to be excavated.
16. devices as claimed in claim 15, it is characterised in that also include:
Acquisition determines module, trains rank model for obtaining corresponding with the user of known users real name being used for
Multiple training user's remark informations, and by identical training user in the plurality of training user's remark information
Remark information is respectively as training candidate's real name;
Set determine module, for by for described user's real name training candidate's real name corresponding to respectively train use
Family remark information supports set as first;Described first supports corresponding first scoring values of set;
Described set determines module, is additionally operable to for non-described user's real name and have the complete of described user's real name
Each training user's remark information corresponding to training candidate's real name of phonetic supports set as second;Described
Two support corresponding second scoring values of set, and described first scoring values is more than described second scoring values;
Model training module, supports set for extracting the described first feature supporting set and described second
Feature, and according to the described first feature supporting set and described first scoring values, described second support collection
The feature and described second scoring values that close are set up and train rank model;
Wherein, described first in the rank model after training supports set and described second and support set is to use
In the candidate's real name inputted is marked.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610387322.5A CN106021235B (en) | 2016-06-01 | 2016-06-01 | A kind of data mining processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610387322.5A CN106021235B (en) | 2016-06-01 | 2016-06-01 | A kind of data mining processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106021235A true CN106021235A (en) | 2016-10-12 |
CN106021235B CN106021235B (en) | 2019-01-29 |
Family
ID=57089437
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610387322.5A Active CN106021235B (en) | 2016-06-01 | 2016-06-01 | A kind of data mining processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106021235B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107329672A (en) * | 2017-07-18 | 2017-11-07 | 携程旅游网络技术(上海)有限公司 | Pass through the method for mouse track final election hyperlink, system, equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102004788A (en) * | 2010-12-07 | 2011-04-06 | 北京开心人信息技术有限公司 | Method and system for intelligently positioning linkman of social networking services |
CN104573076A (en) * | 2015-01-27 | 2015-04-29 | 南京烽火星空通信发展有限公司 | Social networking site user Chinese remark name system recommendation method |
JP2015132902A (en) * | 2014-01-09 | 2015-07-23 | サクサ株式会社 | Electronic conference system and program of the same |
-
2016
- 2016-06-01 CN CN201610387322.5A patent/CN106021235B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102004788A (en) * | 2010-12-07 | 2011-04-06 | 北京开心人信息技术有限公司 | Method and system for intelligently positioning linkman of social networking services |
JP2015132902A (en) * | 2014-01-09 | 2015-07-23 | サクサ株式会社 | Electronic conference system and program of the same |
CN104573076A (en) * | 2015-01-27 | 2015-04-29 | 南京烽火星空通信发展有限公司 | Social networking site user Chinese remark name system recommendation method |
Non-Patent Citations (2)
Title |
---|
BYUNG-WON ON: "Social Network Analysis on Name Disambiguation and More", 《THIRD 2008 INTERNATIONAL CONFERENCE ON CONVERGENCE AND HYBRID INFORMATION TECHNOLOGY》 * |
任蔷 等: "实名SNS社交网络与微博的特征分析", 《现代情报》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107329672A (en) * | 2017-07-18 | 2017-11-07 | 携程旅游网络技术(上海)有限公司 | Pass through the method for mouse track final election hyperlink, system, equipment and storage medium |
CN107329672B (en) * | 2017-07-18 | 2020-01-14 | 携程旅游网络技术(上海)有限公司 | Method, system, device and storage medium for checking hyperlink through mouse track |
Also Published As
Publication number | Publication date |
---|---|
CN106021235B (en) | 2019-01-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102930055B (en) | The network new word discovery method of the connecting inner degree of polymerization and external discrete information entropy | |
CN103076892B (en) | A kind of method and apparatus of the input candidate item for providing corresponding to input character string | |
US20200081977A1 (en) | Keyword extraction method and apparatus, storage medium, and electronic apparatus | |
CN103853738B (en) | A kind of recognition methods of info web correlation region | |
CN103914494B (en) | Method and system for identifying identity of microblog user | |
CN107169001A (en) | A kind of textual classification model optimization method based on mass-rent feedback and Active Learning | |
CN106570144A (en) | Method and apparatus for recommending information | |
CN108280130A (en) | A method of finding sensitive data in text big data | |
CN109684446B (en) | Text semantic similarity calculation method and device | |
EP2657852A1 (en) | Method and device for filtering harmful information | |
KR20110115542A (en) | Method for calculating semantic similarities between messages and conversations based on enhanced entity extraction | |
CN109582704A (en) | Recruitment information and the matched method of job seeker resume | |
US20110202545A1 (en) | Information extraction device and information extraction system | |
CN101425071A (en) | Location expression detection device and computer readable medium | |
EP3029567B1 (en) | Method and device for updating input method system, computer storage medium, and device | |
US20170351739A1 (en) | Method and apparatus for identifying timeliness-oriented demands, an apparatus and non-volatile computer storage medium | |
CN112541095B (en) | Video title generation method and device, electronic equipment and storage medium | |
CN103377245A (en) | Automatic question and answer method and device | |
CN105893484A (en) | Microblog Spammer recognition method based on text characteristics and behavior characteristics | |
CN106650446A (en) | Identification method and system of malicious program behavior, based on system call | |
CN106354818A (en) | Dynamic user attribute extraction method based on social media | |
CN103955450A (en) | Automatic extraction method of new words | |
WO2024011933A1 (en) | Combined sensitive-word detection method and apparatus, and cluster | |
CN114969326A (en) | Classification model training and semantic classification method, device, equipment and medium | |
WO2018150472A1 (en) | Exchange-type attack simulation device, exchange-type attack simulation method, and exchange-type attack simulation program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20240104 Address after: 518057 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 floors Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd. Address before: 2, 518000, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. |