CN102495860B - Expert recommendation method based on language model - Google Patents

Expert recommendation method based on language model Download PDF

Info

Publication number
CN102495860B
CN102495860B CN 201110373475 CN201110373475A CN102495860B CN 102495860 B CN102495860 B CN 102495860B CN 201110373475 CN201110373475 CN 201110373475 CN 201110373475 A CN201110373475 A CN 201110373475A CN 102495860 B CN102495860 B CN 102495860B
Authority
CN
China
Prior art keywords
user
model
expert
probability
answer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 201110373475
Other languages
Chinese (zh)
Other versions
CN102495860A (en
Inventor
崔斌
姚俊杰
阴红志
刘晴芸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN 201110373475 priority Critical patent/CN102495860B/en
Publication of CN102495860A publication Critical patent/CN102495860A/en
Application granted granted Critical
Publication of CN102495860B publication Critical patent/CN102495860B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an expert recommendation method based on a language model. The expert recommendation method includes steps of S1, collecting published content of users to representing knowledge features of the corresponding users, realizing probability-based text modeling for the knowledge features of the users by the aid of the language model in text retrieval, and arranging expert validity indexes of the users in a built model; S2, building a user relationship model among the users, realizing user relationship graphs in the user relationship model, and generating mutual influence ofthe users by the aid of expert validity of the users; S3, computing original expert validity according to expert validity retrieval information of each user when an inquiry is given, and generating an original user sorted list; and S4, adjusting the expert validity of each user according to interaction among the user relationship graphs, and obtaining a final user sorted list.

Description

Expert's recommend method based on language model
Technical field
The present invention relates to Internet technical field, be specifically related to a kind of expert's recommend method based on language model.
Background technology
The expert recommends to refer to modeling its authority and speciality field from a collection of candidate user, the process of recommending out the expert user of coupling under given query requests., in the information of magnanimity is piled up, existed a large amount of useful but information that do not excavate as yet urgent problem also to be arranged because run into and do not become the information dead angle of not circulating suitable opportunity.The purpose that the expert recommends is to wish the direction of information transmission is become two-way interchange from single search, can provide effective information to push mechanism, effectively circulates thereby accelerate the network information.As an important user management branch, expert's recommend method before this or the application scenario and the model that are subject to traditional field are considered etc., or only adopt simple user modeling and sort method.They have deficiency on dirigibility and generality, can't adapt to the expert's identification under the new internet environment.
The rise of network interdynamic community makes people that providing and obtaining of information is provided more and more.The expert who in a large amount of enquirements and answer, under cover customizing messages relatively is proficient in, the problem that the knowledge that they grasp can correctly propose the user provides answer, and therefore the answer that they provide also has more value.With respect to the wait after the past enquirement, seek the expert on one's own initiative, problem is pushed to the user that can answer targetedly, improve the accuracy of answer and the speed of answer, can increase the promptness of question and answer network information interaction in the community greatly, meet the growth momentum of question and answer mode community more.
Yet the expert does not mark in community in a large number.Such as Tripadvisor, extremely interrogate and go and where wait in the travel forum, there is the network manager to make judgement to a large amount of money order receipts to be signed and returned to the sender and identifies the expert, though the expert that this artificial cognition identification is come out has very high success ratio, efficient is very low.Know that in Baidu in the domestic website such as Yahoo's knowledge hall, all carry out rank by integration, and integration depends on quantity, the active degree of answering a question more, these can not be as unique basis for estimation of expert.Especially for answering user of low quality, occupy very big advantage by quantity and the active degree of answering probably, may mask the performance of real expert user.
The researcher wishes that the model that the user is delivered carries out semanteme identification, analyzes from content, thereby determines the quality that the user answers.Whether be that the expert in a certain field makes more accurately and judging to the user thus.
Judging that inquiry and document be whether relevant has a variety of decision methods, for example boolean's model, a vector model etc.Also can provide the degree of correlation of query demand and document by probability.If usually the user can think that a document is relevant with inquiry then word in can occurring inquiring about in document, probability model calculating be the probability that query demand and document occur simultaneously, namely P (1|q, d), wherein q is query demand, d is document.
Traditional probability model carries out modeling to the degree of correlation of whole inquiry and document, then is that document is carried out modeling in language model, and the language model that calculates this document produces the probability of inquiry, i.e. P (q|M d), wherein q is inquiry, M dBe the language model that document is set up.In language model, document is counted as one " dictionary " that may produce query demand, provides a query demand at every turn, determines the degree of correlation by calculating the generation probability.
Summary of the invention
(1) technical matters that will solve
The objective of the invention is to propose a kind of expert's recommend method based on language model, can seek the expert on one's own initiative, problem is pushed to the user that can answer targetedly, improves the accuracy of answer and the speed of answer, increase the promptness of question and answer network information interaction in the community.
(2) technical scheme
In order to solve the problems of the technologies described above, the invention provides a kind of expert's recommend method based on language model, comprise step:
S1: collect user's the content distributed knowledge feature that characterizes relative users, adopt language model in the text retrieval to come user's knowledge feature is carried out text modeling based on probability, have user expert's degrees of data index in the model of foundation;
S2: set up the customer relationship model between user and user, have customer relationship figure in the customer relationship model, expert's degree of user can exert an influence to the other side mutually;
S3: providing when inquiry, calculating original expert's degree according to expert's degree index information of each user, and provide the original user sorted lists;
S4: the expert's degree of adjusting each user that connects each other according between the customer relationship figure obtains final user's sorted lists.
Preferably, among the described step S1 be each model that the user is sent out as the modeling of a dictionary sample, infer to produce the model of this sample behind, produce the probability of inquiring about thereby calculate this model.
Preferably, set up language model two aspects, the firstth, set up the independent language model M at each model of user Td, the secondth, regard all models of all users as a text and set up language model M cThe probability of occurrence of whole inquiry is
p ( q | M tq ) = Π t ∈ q ( β * p ( t | M td ) + ( 1 - β ) * p ( t | M c ) )
Wherein, q is query text; β is parameter, is used for weight between the whole model collection of balance and the single model.
Preferably, calculating on model and the optimum answer similarity, employing be vector model; Set up the high dimension vector space respectively for model and the optimum answer of quality to be assessed, calculate the angle between two vectors, the angle of more big then two vectors of the cosine function value of angle is more little, is considered as that the two is more similar, the vectorial similarity of this model and optimum answer
Sim ( td , BestAns ) = cos ( td ‾ , BestAns ‾ ) = td ‾ * BestAns ‾ | td ‾ | * | BestAns | ‾ .
Preferably, the model collection of all answers of user is regarded as a big text library, this text library is set up language model, calculate the probability that this language model can produce current answer, the knowledge feature architecture that whether derives from the user according to the answer content is judged the quality of answer.
Preferably, the computing formula of length correction employing is
Figure BDA0000111068760000041
Preferably, user's knowledge feature is stored with text inverted index form among the described step S1, and supports lasting renewal and follow-up cleaning operation.
Preferably, the parameter estimation of the model of setting up among the step S1 can adopt point estimation, square to estimate or maximal possibility estimation.
Preferably, described step S2 sets up the customer relationship model based on the PageRank algorithm of Google.
Preferably, after the expert's degree ordering that calculates the user according to the language model of setting up, revise by user's PageRank value, each user will be multiply by shared problem number as weighting before utilizing this user's PageRank value.
(3) beneficial effect
The present invention is by setting up language model and customer relationship model, extract the expert's degree between user and the data, revise the ordering of user expert's degree by user and the graph of a relation between the user from extracting data then, with respect to the wait after the past enquirement, seek the expert on one's own initiative, problem is pushed to targetedly the user that can answer, improve the accuracy of answer and the speed of answer, can increase the promptness of question and answer network information interaction in the community greatly, meet the growth momentum of question and answer mode community more.
Description of drawings
Fig. 1 is the structure flow chart of the inventive method;
Fig. 2 is assessment indicator comparison diagram in the embodiment of the invention, and every index from left to right is respectively query1-6;
Fig. 3 is another assessment indicator comparison diagram in the embodiment of the invention, and every index from left to right is respectively query1-6.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for explanation the present invention, but do not limit the scope of the invention.
The structure flow chart of the expert's recommend method based on language model of the present invention as shown in Figure 1, wherein frame of broken lines is indicated algorithm model, solid box is indicated data.Process flow diagram moves from top to bottom from left to right.The top is static, can set up in advance; The below is dynamic, needs to import instant calculating according to the user.The present invention includes step: S1: the content distributed knowledge feature that characterizes relative users of collecting the user, adopt language model in the text retrieval to come user's knowledge feature is carried out text modeling based on probability, have user expert's degrees of data index in the model of foundation; S2: set up the customer relationship model between user and user, have customer relationship figure in the customer relationship model, expert's degree of user can exert an influence to the other side mutually; S3: providing when inquiry, calculating original expert's degree according to expert's degree index information of each user, and provide the original user sorted lists; S4: the expert's degree of adjusting each user that connects each other according between the customer relationship figure obtains final user's sorted lists.
Language model method and derivation
Language model is a kind of of probability model, and the problem that will solve in this patent is exactly to provide maximally related user list therewith according to the problem of importing.That is, obtain p (u|q) according to query text and the existing information of posting of user, wherein u is the user, and q is query text.
According to Bayes' theorem, have
p ( u | q ) = p ( q | u ) p ( u ) p ( q )
For user's probability of same inquiry, the value of p (q) is consistent in relatively, can omit.Evenly distribute at this hypothesis p (u), therefore only need to carry out user's ordering according to p (q|u) and get final product.Here, the model of all answers by user u is as the modeling foundation of u.Therefore
p(q|u)-p(q|M pf)?=f(p(q|M td1),p(q|M td2),...,p(q|M tdn))
Wherein pf is the set of all models of user u, td1 ∈ pf, and i=1 wherein, 2 ..., n.
Model of the present invention be each model that the user is sent out as the modeling of a dictionary sample, infer to produce the model of this sample behind, produce the probability of inquiring about thereby calculate this model.In the time of modeling, need calculate the generation probability of each word in the text.Parameter estimation can adopt point estimation, and square is estimated, maximal possibility estimation etc.Parameter estimation adopts maximal possibility estimation in experiment.
Maximal possibility estimation (MLE) supposes that there is a probability distribution behind in a text, and this text presentation is a sample of whole distribution.
p(t1,t2,...,tn;θ)=P(T1=t1,T2=t2,...,Tn=tn)
Wherein θ is unknown parameter.
For make p (t1, t2 ..., tn; Probability maximum θ), to p (t1, t2 ..., tn; θ) ask local derviation, can calculate the value of unknown parameter respectively.Can find that the probability of the appearance of a word t is calculated as according to MLE in the text:
Figure BDA0000111068760000061
Set up language model two aspects, first is to set up the independent language model M at each model of user Td, second is to regard all models of all users as a text to set up language model M cThe probability of occurrence of whole inquiry is
p ( q | M tq ) = Π t ∈ q ( β * p ( t | M td ) + ( 1 - β ) * p ( t | M c ) )
Wherein β is parameter, is used for weight between the whole model collection of balance and the single model.Such benefit is that probability is 0 when having avoided the word in the inquiry not appear in the single model, thereby the probability that causes whole inquiry to be produced by single model also is 0.
Basic model
Each model of user can both calculate a probability that produces inquiry.How to be merged into the probability that the whole models of user produce this inquiry be the problem that this patent discusses, i.e. equation to the probability that these models are produced
p(q|M pf)=f(p(q|M td1),p(q|M td2),...,p(q|M tdn))
The computation process of middle function f.
Equal equal weight
The most basic language model is the weight equalization of each model.After calculating the probability that each model produces inquiry, the whole model collection that addition averages to obtain this user produces the probability of this inquiry.
Algorithm 3-1
For each user
For each term in the inquiry
Each model for the user
Calculate the probability that P (t|td)=this term is produced by this model
P (t|M Td) be that P (t|td) sum that all models of user produce on average obtains the probability that this user produces this term by the number of posting
The probability multiplication acquisition user who the user is produced each term produces the probability of this inquiry
According to user's probability, ordering output
Benchmark model
Whether effectively benchmark model is to weigh new algorithm normative reference.The normative reference that this patent proposes has two, and one of them is aforesaid basic language model, and in this language model, the weight of each model equates.Another one is only not sort according to user's the amount of posting with reference to content.The latter is the same treating expert's ordering that any inquiry provides, as seen irrelevant ordering still is inflexible with content, irrelevant topic is being talked about and inquired about to a large amount of models of possible user all, and the ordering that has provided by the model number since like this is just very inaccurate.On basic language model, this patent carries out multiple improvement, attempts to seek more effective algorithm.
The optimum answer similarity relatively
Answer on the Answers, if putd question to the people to be chosen to be optimum answer, the model quality height of answering then is described, the this point advantage can be added in the quality evaluation to model, revise the probability that is only calculated by MLE by the distance between user's answer and the optimum answer, no longer each model is regarded as of equal importancely, but higher-quality model is given higher importance degree.A weight is multiply by in taking the form of calculating of revising when producing the probability of query demand by this model.On meaning, this weight is more high, illustrates that then this model quality is more high, and the user is more may become the expert answering this respect problem; On final user's score, after multiply by this weight, the model of different this problems of answer produces the probability difference of this inquiry apart from just having strengthened, and the similarity near optimum answer is more big, the probability that this model can be answered this query demand is more high, finally improves user's mark.
Calculating on model and the optimum answer similarity, employing be vector model.Set up the high dimension vector space respectively for model and the optimum answer of quality to be assessed, calculate the angle between two vectors.The angle of more big then two vectors of the cosine function value of angle is more little, is considered as that the two is more similar.
Sim ( td , BestAns ) = cos ( td ‾ , BestAns ‾ ) = td ‾ * BestAns ‾ | td ‾ | * | BestAns | ‾
Algorithm 3-2
For each user
For each term in the inquiry
Each model for the user
The probability * Sim that calculating P (t|td)=this term is produced by this model (td, BestAns)
P (t|td) addition that all models of user produce obtains the probability that this user produces this term
The probability multiplication acquisition user who the user is produced each term produces the probability of this inquiry
According to user's probability, ordering output
Annotate: (td BestAns) is the vectorial similarity of this model and optimum answer to Sim.
From whole answers, produce the contribution degree of answer
In order to calculate the quality of model, except the similarity that compares model and optimum answer, can also from the model of all answers of user, obtain reference frame.The user is if be proficient in the words of certain aspect knowledge, the word collection relevant with this class knowledge then should be able to often appear in his answer, that is to say, can assert if word often appears in user's the whole model collection and use this word in the model of this answer, then the user is proficient in knowledge in this respect, degree of carrying weight, and in this model, used the professional knowledge of this respect to provide his answer, thus assert that this model should have higher quality.
The existing knowledge that this contribution degree can be regarded as the user has produced current model.Therefore can regard the model collection of all answers of user as a big text library, this text library is set up language model, calculate the probability that this language model can produce current answer.The knowledge hierarchy that whether derives from the user according to the answer content is judged the quality of answer.This contribution degree multiply by the probability that model produces inquiry the most at last, makes high-quality answer faster to the lifting of user expert's degree thus.Have according to language model,
Figure BDA0000111068760000091
And p (w|M Pf) with P (w|M Td) similar, adopt the word frequency of oneself and the method for the word frequency weighting of total collected works to obtain.
Algorithm 3-3
For each user
For each term in the inquiry
Each model for the user
The probability * P that calculating P (t|td)=this term is produced by this model (td, AIITd)
P (t|td) addition that all models of user produce obtains the probability that this user produces this term
The probability multiplication acquisition user who the user is produced each term produces the probability of this inquiry
According to user's probability, ordering output
Annotate: (td AIITd) produces the probability of current answer for the existing knowledge of user to P.
In addition, also can consider from the problem angle of being answered by it judge of current answer quality.It sets up language model to problem, wishes to calculate the probability that generates answer from problem, weighs the quality of answer.Thought is if occurred the word in the problem in the answer, can think that the user answers targetedly, rather than answer carelessly.The approach of another consideration is problem to be regarded as by answer generate, and can judge equally whether the user is answering targetedly.Language model is paid attention to the analysis to the content similarity, if the therefore vocabulary that do not overlap of correct option and problem, the method for this tax weights of language model is just not fully up to expectations.
Length standardization
User's answer custom is had nothing in common with each other.At user or length or short answer, add the probability that certain length is calculated with the compensation language model to the user.Because in language model, do not relate to the lateral comparison between the user who answers same problem, just produce the probability calculation of new inquiry.In probability model, because model length is more long, the probability that then produces a word is more little, and therefore, the probability of the likelihood ratio generation short essay of one piece of long article of generation is much smaller.
Be exemplified below, document A content is XYY, and document B content is XY.If inquiry q is XY, then the probability of document B generation q is
1 2 × 1 2 = 1 4
And the probability of document A generation q is
1 3 &times; 2 3 = 2 9 < 1 4
But on directly perceived, the described content degree of correlation of A is greater than B's.
Different for the mark that causes different in size that reduces user's answer, statistics is answered the total length of all models of same problem, revises according to the own component that accounts in total length again.In this patent, the computing formula that length correction is used is
Algorithm 3-4
For each user
For each term in the inquiry
Each model for the user
Calculate the probability * UserPortion (td) that P (t|td)=this term is produced by this model
P (t|td) addition that all models of user produce obtains the probability that this user produces this term
The probability multiplication acquisition user who the user is produced each term produces the probability of this inquiry
According to user's probability, ordering output
The customer relationship model
It all mainly is at unique user itself that language model and weight thereof are improved, but in community, all can form a network between user and user unintentionally, hiding the contact between them in this network, if these contacts are excavated out, perhaps can judge expert's degree of user to play a positive role.For example its knowledge hierarchy of the user who answers a question under same problem should partially overlap, when often appearing at together, user A and user B answer a question, and be that another also is the expert to the expert probably if then judge one of them.
The applicability of PageRank
The PageRank algorithm is that Google is used for an algorithm of presentation web page rank degree, and its core concept is if a webpage is linked by a lot of other webpages, illustrates that it is subjected to general admit and trusting, so its rank height just.This algorithm carries out the calculating of system with network from the angle of a figure, only pays close attention to the inquiry of a website and it the receiving scope of limitation like this and be unlike in research before it.
Equally, in expert's degree rank of user, can utilize the thought of PageRank to remedy the defective of only calculating expert's degree by content.If the word of inquiry is concentrated rare appearance at a user's model, even then high-quality answer, can be because of not matching and can't obtain high probability on the content yet.This expert user can be by answering same problem other users' height assign to remedy oneself low score value.Therefore, our the shared number of times that need excavate between the user to define sharing degree to each user.It is more high to wish to share number of times, and then two people's PageRank value all becomes very high, can revise the expert's degree of oneself for the low side of score value.This has just in time met the intention of PageRank algorithm.
Utilize the PageRank thinking to obtain sharing degree
Regard each user as a node, if appear at simultaneously in the same questions answer then add the link of pointing to the other side mutually.Set up the two-dimensional plot of Co-occurrence by above method, the number of the problem of Hui Daing is as the weight on the limit that connects two users simultaneously.After obtaining customer relationship figure, utilize iteration to draw each user's sharing degree.
Suppose that the initial PR value of all users is identical, each user goes out this initial value branch according to the weight on the limit of pointing to other users.Traversal finishes, and each user adds up the own PR sum that obtains from others there, as own new PR value.PR value after about 30 iteration of process obtain restraining.No matter how verified theoretically initial value is chosen, this algorithm has guaranteed that all user's PR value can converge to their actual value.
Algorithm 3-5
For user u
The weight sumofW on all limits that calculating is pointed out;
To each user u '
According to
Figure BDA0000111068760000121
Calculate and dispense the PR value
Be added on the new PR value of this user
With above iteration to the PR value stabilization
Annotate: w (u, u ') is user u and between u ' shared model number
How to add the PageRank value in the ordering
After the expert's degree that calculates the user according to language model is arranged, revise by user's PageRank value.If a user's PageRank value is higher, he is described often and other users answer a question together.When utilizing the PageRank value to revise, can not simple interest be worth to do weighting with this, because the problem number that each user and this user share and inequality.
Each user will be multiply by shared problem number as weighting before utilizing this user's PageRank value.Namely, if only shared several problems between party A-subscriber and the party B-subscriber, and and shared up to a hundred problems between the C user, then A is in correction self expert's degree, can access expert's value of more C, and A because and the problem shared of B few, can only obtain expert's value of a spot of B, can not be benefited even the sharing degree of B is very high, because the sharing degree height of B is not caused by A.
Utilize the PR value as follows from the modified value that other users obtain in the experiment there
&Delta; ( u ) = &Sigma; u &prime; - u oldscore ( u &prime; ) * w ( u , u &prime; ) * PR ( u &prime; ) &Sigma; u &prime; &NotEqual; u w ( u , u &prime; ) * PR ( u &prime; )
Expert's degree after then the user upgrades is
newscore(u)=δ*oldscore(u)+(1-δ)*Δ(u)
Algorithm 3-6
For user u
Travel through each user u '
The weights of expert's degree that u can take are the PR value of u ' and the product of w (u, u ')
Expert's degree of giving u be according to
Figure BDA0000111068760000132
Calculate expert's degree that distribute
Old expert's degree weighting according to certain ratio and u obtains new expert's degree
Annotate: w (u, u ') is user u and between u ' shared model number
Experimental analysis
The selection of experimental data and processing
Computer﹠amp on the Answers; Data under the Internet subclass.100 users and more than 10,000 data that they deliver have been chosen.Left out numeral in the content of model, symbol has been ignored capital and small letter, and special the processing is not done in the link that comprises in the model, only is used as general text and treats.
Such processing may have been lost some information.Capital and small letter may comprise specific meaning, MAC for example, and what refer to is the physical address that is solidificated on the network interface card, is the notebook computer that Apple company produces and Mac is often referred to generation.But capital and small letter also may be just in order to cause that the reader notes, HELP for example, THANKS then and help, thanks does not have difference.Because numeral occurs in the text and specific implication is arranged unlike word, content common and article does not have positive connection, does not therefore list the calculating of probability contribution in.
Calculate an expert (user) rank according to inquiry after, need to tabulate relatively with actual expert.Whether do not provide detailed expert's tabulation on the Answers, therefore need manually each user to be weighed, determining can be as the expert of this inquiry.When whether assessment is the expert, need browse all problems of answering of this user.Established 6 new problems altogether, 100 possible experts (user) have been made judgement.Evaluation result is divided into relevant and uncorrelated two-stage, does not divide more multi-level.
Experimental result and analysis
The introduction of the parameter of assessed for performance
MAP: the MAP of single query is the mean value of the accuracy rate after each relevant expert retrieves out.The MAP of the overall situation is the mean value of the MAP of 6 inquiries.
MRR: the mean value of the position of expert in ordering that first is relevant.It can illustrate in order to find the relevant expert need along the user's sorted search distance how long that provides.
R-Precision: in the result for retrieval, in the locational accuracy rate of all relevant expert sums
Pn: the ratio that comprises correct relevant expert in preceding n ordering.
Parameter chooses in the experiment
The parameter of language model is selected
Two parts are depended in the foundation of language model, and one is the contribution degree of single model, and another is the contribution degree at whole model number.The latter plays level and smooth effect.The weight of remembering the contribution degree of single model is β, has carried out following experiment with regard to the value size of β:
Produce in the model of answer in problem:
Figure BDA0000111068760000151
It is best that the value that can see β shows in 0.5, and wherein the variation of MRR is particularly evident, and amplification reaches more than 22%.On other indexs, the difference that three kinds of values cause is not obvious.
Post in the model that has problems according to user's history:
Figure BDA0000111068760000152
In this model, the performance of β and very inequality in a last model, in the time of β=0.9, the performance of model is best, but the difference between three values is not obvious generally.For two models more than the balance and after the performance difference that brings owing to choosing of β of multi-model more, the default value of β is set at shows 0.7 of the golden mean of the Confucian school comparatively.
Choosing of Co-occurrence weight
For expert's degree adjustment of the user that can be occurred together on original user mark basis, expert's degree that the branch that need make weight is equipped with the user who determines answer simultaneously to what extent influences the user.Parameter δ represents the user from the weighting ratio that accounts in new mark, opposite extreme situations is δ=0, is namely decided expert's degree of this user fully by other users of while answer.Equally two models are tested.Wherein, β=0.7.
A) produced the model+rerank of answer by problem
Figure BDA0000111068760000153
B) model+rerank that posts and have problems according to user's history
Figure BDA0000111068760000154
Different with the data of a last model, when the highest value of MRR appears at δ=0.4, and in δ=0.4th,, 3 sample data reach the highest.For the difference of bringing between comprehensive different models, select the most stable δ=0.8 of performance to be used as default value.
The performance comparison
The contribution degree of model changes
Datum line is following two:
Baseline 1: sort out according to the model amount of replying
Baseline 2: calculate according to the contribution of weight equalization.
Experimental technique comprises five:
Method 1: the probability that the problem that answer is answered by it produces
Method 2: the probability that model is produced by the historical model of user
Method 3: the similarity of the optimum answer of model and institute's question and answer topic
Method 4: the ratio of the length of model in all replying
List experimental result below:
Figure BDA0000111068760000162
Can see that baseline 2 is the most weak in performance, that is, the weight of each model of mean allocation is unfavorable for grasping expert's degree of user.Because it does not stress the model that the user delivers, so that each model contribute is identical, give prominence to user's knowledge emphasis in the knowledge of whole user.And baseline 1 is lower slightly in whole result, and its performance is compared with other models does not have huge difference.Main cause is Data Source.Baseline 1 sorts according to model quantity, and the data decimation of current forum derives from same catalogue (Computer ﹠amp; Internet) the sub-forum under.More relevant at computing machine and a lot of topics of network facet, similar knowledge is easy to grasp together, can be inclined to really and is judged as the expert so answer maximum people under this forum.If the replacing data source, for example the sub-forum with Computer and these two forums of History combines, and then arranges expert's degree according to the amount of posting again, and performance is inevitable very low.Maximum and under Computer, post at most because under History, post, mutually knowledge system is not each other all understood very much.
Baseline 1 is not based on the ordering of language model, and in the comparison of baseline 2 and Method 1, Method2 as can be seen, weight between the model changes very important.P10 among Method 1 and the Method 2 is 0.6167, and P1-only is 0.3333 among the baseline 2.This shows that the quality of analyzing model from content is feasible.No matter be to judge or judge according to user's history according to problem, all still adopt language model to produce probability.The specific ken of each model of user as the user, determine the expertise scope that the user is total according to different weightings.
Having outstanding performance of Method 3 is MRR near 80%, only this means to find first or second just can access first correct expert along the ordering that provides.Why Method 3 can have good performance to be because come the comparison similarity with optimum answer at MRR is the mode that quality is answered at present more outstanding judgement really.The generation of optimum answer is manually selected by the quizmaster, and this step is the advantage that other several methods do not have.Other method is in the quality of coming from the various parameters of model to judge model with the method for Modeling Calculation, and the parameter that the high-quality model has is likely the speciality that does not meet modeling, length for example, keyword word frequency.Therefore if can be directly and the optimum answer comparison, be only the most effective judgement model method for quality, it should be the most rational close to the quality of manually being judged this model by the people.
Method 4 is not that MRR and P5 are higher in three kinds of methods of modeling under having situation about manually judging, but other several indexs of this method are all not too outstanding.This method is that reducing on the probability of wishing to bring long model remedies.The performance that shows at it can be understood as the degree of correlation that this method can increase long model really, but the probability of short model is weakened also very serious, makes it to find out the expert well in several preceding, thereby improves MRR and P5.But too make that for long model increases weight the expert's that has short model rank is not forward, cause other overall target to be given prominence to.
Performance about length standardization compares
Because Method 1 and Method 2 have adopted language model when the calculating of model weight, also add in above-mentioned two kinds of methods and go so I attempt weight with Method 4, see that whether can adapt to language model more changes the sensitivity of length.
Method 4 wishes to reduce on the probability the long model of length to be brought negative effect by the compensation on the length.Performance is all right in isolated operation Method 4, can be also not very very poor though do not give prominence to.But applying to Method 1 and effect and bad above the Method 2.Most index descends to some extent.
Sum up reason and may be since length compensation in the direct weighting as the model of language model, can adequate compensation the probability problem less than normal of model of long width of cloth, but ought exist another set ofly by adding temporary that language model is set up, directly use length compensation just to muddle things up again.Because the language model as weighting is processed again according to model, so the length of model no longer is simple linear relationship, become complicated, also no longer can compensate by the coarse length standardization that is directly proportional.
The performance comparison of Rerank
Figure BDA0000111068760000182
Figure BDA0000111068760000191
Done the weighting adjustment according to the two-dimensional plot that occurs simultaneously between the user according to the PageRank value.After adjusting, can see certain methods has still been reached effect, the especially MRR that improves performance.Method 1 MRR behind rerank brings up to 0.68, Method2 MRR behind rerank from 0.58 and brings up to 0.77 from 0.58.The weight of these two kinds of methods is produced by probabilistic Modeling, whether any one word occurs and the frequency of occurrences all can produce bigger influence to probability, thereby after comprehensive co-occurrence figure, other users' answer is arranged as stable substrate, reduced the undulatory property of the variation of accidental appearance.
The optimization DeGrain of method on Method 3 and Method 4.Reason is interpreted as that Method 3 is contribution degrees of adjusting model according to the similarity of optimum answer, and it is more accurate to estimate, and the corrective action of co-occurrence figure is little; And the weight adjusting range of long model is bigger than normal among the Method 4, the expert who has a long and short model is more serious in the ordering differentiation, and the user of long model is forward, lack the leaning on of model after, after the weight of comprehensive co-occurrence figure, have to present the situation of gathering to the centre.Therefore P5 descends, and P10 and P15 have small size lifting, shows that the expert has the trend that becomes close at the tabulation middle part.
Performance difference between Query
Above test and appraisal all are to do discussion by the average level of 6 query.In different query, difference is also bigger, discusses with the best Method 2+Rerank of combination property below.
Can find that in Fig. 2 query3 all indexs in these 6 query all are very low.MRR just arrives 0.1.Query1, query2 and query6 can both reach 70% substantially and reach above performance, and whole results of query3 but pace up and down 0.1~0.2 all the time.Query4 and query5 belong to by-level in performance.
Difference on the model does not produce big influence in the relative scale of these 6 query.In the Method 3 that the similarity by optimum answer that does not add rerank assigns weight, the assessment indicator of 6 query as shown in Figure 3:
Similar with Fig. 2, query3 is that these 6 intermediate performances are the poorest all the time as can be seen, query1,2,6 performances better.As seen, the difference of model is not the main cause that causes this result, and maximum difference is their content between query.
6 query the contents are as follows:
Figure BDA0000111068760000201
Query1, why 2,6 can show well, and main cause has two:
Content energy and the occurring words of asking relatively meet." buy " of query2 for example, " macbook ", though query 2 is very short, the content that ask all occurs in query, and these several words can express clearly what the content of wanting is, targetedly.
The word that the key words that occurs and total model are concentrated has the coincidence of high frequency." surf " among the query6 for example, " internet " all is to offer some clarification on very much the word of puing question to direction, concentrates at total model can occur because of their high frequency and improve expert's degree of this user
Just there is following problem and show among the bad query:
Enquirement does not have specific aim.Query4 for example, content does not fully simultaneously have special keyword can show the main demand of oneself inadequately yet." slow " and " speed " is easily at whole C ompuer ﹠amp; Produce ambiguity in the forum of Internet.What only may put question to according to keyword is the software and hardware problem of Computerz subclass inquiry computing machine, also may be the problem of inquiry network congestion under the Internet subclass.Therefore, be difficult to provide accurate user expert's degree.For another example, " clean " word among the query3, can notice among the query5 also has " clean ", however a cleaning that refers to outward appearance, another refers to the cleaning of virus and garbage files
Even the word of puing question to the accurately meaning of expression problem also belongs to the low-frequency word that entire document is concentrated.
Content among the query3 is longer in 6 query, and main keyword is " dirty ", and " clean ", targetedly.Why the lower reason of performance in each model may be the number of times that occurs of dirty one word seldom, whole collected works are 894808 words altogether, on average the number of times that occurs of each word is 50, and the frequency that dirty occurs only 15 times.See following table for details,
Figure BDA0000111068760000211
The above only is preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the technology of the present invention principle; can also make some improvement and replacement, these improvement and replacement also should be considered as protection scope of the present invention.

Claims (7)

1. the expert's recommend method based on language model is characterized in that, comprises step:
S1: collect user's the content distributed knowledge feature that characterizes relative users, adopt language model in the text retrieval to come user's knowledge feature is carried out text modeling based on probability, have user expert's degrees of data index in the model of foundation;
Each model that the user is sent out is inferred to produce the model of this sample behind as the modeling of a dictionary sample, is produced the probability of inquiring about thereby calculate this model;
Set up language model two aspects, the firstth, set up the independent language model M at each model of user Td, the secondth, regard all models of all users as a text and set up language model M c; The probability of occurrence of whole inquiry is
p ( q | M td ) = &Pi; teq ( &beta; * p ( t | M td ) + ( 1 - &beta; ) * p ( t | M c ) ) ;
Wherein, q is query text; β is parameter, is used for weight between the whole model collection of balance and the single model; T ε q is the word in the query text; Td is the single model that the user sent out; C is all models of all users, i.e. whole model collection; P(t|M Td) be the probability that all models of user produce query word t; P (t|M c) be the probability of all models generation query word t of all users; P (q|M Td) be the probability of all models generation query text q of user;
S2: set up the customer relationship model between user and user, have customer relationship figure in the customer relationship model, expert's degree of user can exert an influence to the other side mutually;
Described customer relationship model is based on that the PageRank algorithm of Google sets up;
S3: providing when inquiry, calculating original expert's degree according to expert's degree index information of each user, and provide the original user sorted lists;
S4: the expert's degree of adjusting each user that connects each other according between the customer relationship figure obtains final user's sorted lists.
2. the method for claim 1 is characterized in that, calculating on model and the optimum answer similarity, employing be vector model; Set up the high dimension vector space respectively for model and the optimum answer of quality to be assessed, calculate the angle between two vectors, the angle of more big then two vectors of the cosine function value of angle is more little, is considered as that the two is more similar, the vectorial similarity of this model and optimum answer
Sim ( td , BestAns ) = cos ( td &RightArrow; , BestAns &RightArrow; ) = td &RightArrow; * BestAns &RightArrow; | td &RightArrow; | * | BestAns &RightArrow; | ;
Wherein, BestAns is optimum answer; (td BestAns) is single model that the user sent out and the vectorial similarity of optimum answer to Sim.
3. the method for claim 1, it is characterized in that, the model collection of all answers of user is regarded as a big text library, this text library is set up language model, calculate the probability that this language model can produce current answer, the knowledge feature architecture that whether derives from the user according to the answer content is judged the quality of answer.
4. the method for claim 1 is characterized in that, the computing formula that length correction adopts is
Figure FDA00003301301800022
5. the method for claim 1 is characterized in that, user's knowledge feature is stored with text inverted index form among the described step S1, and supports lasting renewal and follow-up cleaning operation.
6. the method for claim 1 is characterized in that, the parameter estimation of the model of setting up among the step S1 can adopt point estimation, square to estimate or maximal possibility estimation.
7. the method for claim 1, it is characterized in that, after the expert's degree ordering that calculates the user according to the language model of setting up, revise by user's PageRank value, each user will be multiply by shared problem number as weighting before utilizing this user's PageRank value.
CN 201110373475 2011-11-22 2011-11-22 Expert recommendation method based on language model Expired - Fee Related CN102495860B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201110373475 CN102495860B (en) 2011-11-22 2011-11-22 Expert recommendation method based on language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201110373475 CN102495860B (en) 2011-11-22 2011-11-22 Expert recommendation method based on language model

Publications (2)

Publication Number Publication Date
CN102495860A CN102495860A (en) 2012-06-13
CN102495860B true CN102495860B (en) 2013-10-02

Family

ID=46187685

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201110373475 Expired - Fee Related CN102495860B (en) 2011-11-22 2011-11-22 Expert recommendation method based on language model

Country Status (1)

Country Link
CN (1) CN102495860B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10984387B2 (en) 2011-06-28 2021-04-20 Microsoft Technology Licensing, Llc Automatic task extraction and calendar entry
CN102819605A (en) * 2012-08-17 2012-12-12 东方钢铁电子商务有限公司 Adaptability matching method
CN102880640B (en) * 2012-08-20 2015-04-01 浙江大学 Network modeling-based service recommending method
CN103810169B (en) * 2012-11-06 2018-01-09 腾讯科技(深圳)有限公司 A kind of method and apparatus for excavating community domain expert
CN103631859B (en) * 2013-10-24 2017-01-11 杭州电子科技大学 Intelligent review expert recommending method for science and technology projects
CN105335447A (en) * 2014-08-14 2016-02-17 北京奇虎科技有限公司 Computer network-based expert question-answering system and construction method thereof
CN104636456B (en) * 2015-02-03 2018-01-23 大连理工大学 The problem of one kind is based on term vector method for routing
US10361981B2 (en) 2015-05-15 2019-07-23 Microsoft Technology Licensing, Llc Automatic extraction of commitments and requests from communications and content
US10795878B2 (en) 2015-10-23 2020-10-06 International Business Machines Corporation System and method for identifying answer key problems in a natural language question and answering system
CN107291815A (en) * 2017-05-22 2017-10-24 四川大学 Recommend method in Ask-Answer Community based on cross-platform tag fusion
CN110020096B (en) * 2017-07-24 2021-09-07 北京国双科技有限公司 Query-based classifier training method and device
US11144602B2 (en) 2017-08-31 2021-10-12 International Business Machines Corporation Exploiting answer key modification history for training a question and answering system
CN107766545A (en) * 2017-10-31 2018-03-06 浪潮软件集团有限公司 Scientific and technological data management method and device
CN108153816A (en) * 2017-11-29 2018-06-12 浙江大学 A kind of method for learning to solve community's question-answering task using asymmetrical multi-panel sorting network
CN108287875B (en) * 2017-12-29 2021-10-26 东软集团股份有限公司 Character co-occurrence relation determining method, expert recommending method, device and equipment
CN109635183B (en) * 2018-11-01 2021-09-21 南京航空航天大学 Community-based partner recommendation method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101321190A (en) * 2008-07-04 2008-12-10 清华大学 Recommend method and recommend system of heterogeneous network
CN101751454A (en) * 2009-12-12 2010-06-23 浙江大学 Selection method of network answers based on probabilistic latent semantic analysis
CN101980199A (en) * 2010-10-28 2011-02-23 北京交通大学 Method and system for discovering network hot topic based on situation assessment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101321190A (en) * 2008-07-04 2008-12-10 清华大学 Recommend method and recommend system of heterogeneous network
CN101751454A (en) * 2009-12-12 2010-06-23 浙江大学 Selection method of network answers based on probabilistic latent semantic analysis
CN101980199A (en) * 2010-10-28 2011-02-23 北京交通大学 Method and system for discovering network hot topic based on situation assessment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《A language modeling framework for expert finding》;Krisztian Balog el at.;《Information Processing and Management》;20091231;第1-19页 *
Krisztian Balog el at..《A language modeling framework for expert finding》.《Information Processing and Management》.2009,第1-19页.

Also Published As

Publication number Publication date
CN102495860A (en) 2012-06-13

Similar Documents

Publication Publication Date Title
CN102495860B (en) Expert recommendation method based on language model
US9324112B2 (en) Ranking authors in social media systems
CN103678564B (en) Internet product research system based on data mining
Elmeleegy et al. Mashup advisor: A recommendation tool for mashup development
CN106504011A (en) A kind of methods of exhibiting of business object and device
US20090265290A1 (en) Optimizing ranking functions using click data
CN103020851B (en) A kind of metric calculation method supporting comment on commodity data multidimensional to analyze
US20110196860A1 (en) Method and apparatus for rating user generated content in search results
CN103064945A (en) Situation searching method based on body
CN104834686A (en) Video recommendation method based on hybrid semantic matrix
Gkotsis et al. It's all in the content: state of the art best answer prediction based on discretisation of shallow linguistic features
CN106294863A (en) A kind of abstract method for mass text fast understanding
Sung et al. Booming up the long tails: Discovering potentially contributive users in community-based question answering services
Liang et al. Personalized recommender system based on item taxonomy and folksonomy
Brocken et al. Bing-CF-IDF+: A semantics-driven news recommender system
Malo et al. Concept‐based document classification using Wikipedia and value function
Ravanifard et al. Content-aware listwise collaborative filtering
Lops et al. Improving social filtering techniques through wordnet-based user profiles
Lops et al. A semantic content-based recommender system integrating folksonomies for personalized access
Shen et al. A new user similarity measure for collaborative filtering algorithm
CN107016135A (en) It is a kind of towards non-determined, infidelity, onlap the positive and negative two-way dynamic equilibrium search strategy of miscellaneous resource environment
Jian et al. Multi-task gnn for substitute identification
Batra et al. Content based hidden web ranking algorithm (CHWRA)
Tang et al. Domain problem‐solving expert identification in community question answering
CN102033961A (en) Open-type knowledge sharing platform and polysemous word showing method thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20131002

Termination date: 20191122