CN102495860B

CN102495860B - Expert recommendation method based on language model

Info

Publication number: CN102495860B
Application number: CN 201110373475
Authority: CN
Inventors: 崔斌; 姚俊杰; 阴红志; 刘晴芸
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2011-11-22
Filing date: 2011-11-22
Publication date: 2013-10-02
Anticipated expiration: 2031-11-22
Also published as: CN102495860A

Abstract

The invention discloses an expert recommendation method based on a language model. The expert recommendation method includes steps of S1, collecting published content of users to representing knowledge features of the corresponding users, realizing probability-based text modeling for the knowledge features of the users by the aid of the language model in text retrieval, and arranging expert validity indexes of the users in a built model; S2, building a user relationship model among the users, realizing user relationship graphs in the user relationship model, and generating mutual influence ofthe users by the aid of expert validity of the users; S3, computing original expert validity according to expert validity retrieval information of each user when an inquiry is given, and generating an original user sorted list; and S4, adjusting the expert validity of each user according to interaction among the user relationship graphs, and obtaining a final user sorted list.

Description

Expert's recommend method based on language model

Technical field

The present invention relates to Internet technical field, be specifically related to a kind of expert's recommend method based on language model.

Background technology

The expert recommends to refer to modeling its authority and speciality field from a collection of candidate user, the process of recommending out the expert user of coupling under given query requests., in the information of magnanimity is piled up, existed a large amount of useful but information that do not excavate as yet urgent problem also to be arranged because run into and do not become the information dead angle of not circulating suitable opportunity.The purpose that the expert recommends is to wish the direction of information transmission is become two-way interchange from single search, can provide effective information to push mechanism, effectively circulates thereby accelerate the network information.As an important user management branch, expert's recommend method before this or the application scenario and the model that are subject to traditional field are considered etc., or only adopt simple user modeling and sort method.They have deficiency on dirigibility and generality, can't adapt to the expert's identification under the new internet environment.

The rise of network interdynamic community makes people that providing and obtaining of information is provided more and more.The expert who in a large amount of enquirements and answer, under cover customizing messages relatively is proficient in, the problem that the knowledge that they grasp can correctly propose the user provides answer, and therefore the answer that they provide also has more value.With respect to the wait after the past enquirement, seek the expert on one's own initiative, problem is pushed to the user that can answer targetedly, improve the accuracy of answer and the speed of answer, can increase the promptness of question and answer network information interaction in the community greatly, meet the growth momentum of question and answer mode community more.

Yet the expert does not mark in community in a large number.Such as Tripadvisor, extremely interrogate and go and where wait in the travel forum, there is the network manager to make judgement to a large amount of money order receipts to be signed and returned to the sender and identifies the expert, though the expert that this artificial cognition identification is come out has very high success ratio, efficient is very low.Know that in Baidu in the domestic website such as Yahoo's knowledge hall, all carry out rank by integration, and integration depends on quantity, the active degree of answering a question more, these can not be as unique basis for estimation of expert.Especially for answering user of low quality, occupy very big advantage by quantity and the active degree of answering probably, may mask the performance of real expert user.

The researcher wishes that the model that the user is delivered carries out semanteme identification, analyzes from content, thereby determines the quality that the user answers.Whether be that the expert in a certain field makes more accurately and judging to the user thus.

Judging that inquiry and document be whether relevant has a variety of decision methods, for example boolean's model, a vector model etc.Also can provide the degree of correlation of query demand and document by probability.If usually the user can think that a document is relevant with inquiry then word in can occurring inquiring about in document, probability model calculating be the probability that query demand and document occur simultaneously, namely P (1|q, d), wherein q is query demand, d is document.

Traditional probability model carries out modeling to the degree of correlation of whole inquiry and document, then is that document is carried out modeling in language model, and the language model that calculates this document produces the probability of inquiry, i.e. P (q|M _d), wherein q is inquiry, M _dBe the language model that document is set up.In language model, document is counted as one " dictionary " that may produce query demand, provides a query demand at every turn, determines the degree of correlation by calculating the generation probability.

Summary of the invention

(1) technical matters that will solve

The objective of the invention is to propose a kind of expert's recommend method based on language model, can seek the expert on one's own initiative, problem is pushed to the user that can answer targetedly, improves the accuracy of answer and the speed of answer, increase the promptness of question and answer network information interaction in the community.

(2) technical scheme

In order to solve the problems of the technologies described above, the invention provides a kind of expert's recommend method based on language model, comprise step:

S1: collect user's the content distributed knowledge feature that characterizes relative users, adopt language model in the text retrieval to come user's knowledge feature is carried out text modeling based on probability, have user expert's degrees of data index in the model of foundation;

S2: set up the customer relationship model between user and user, have customer relationship figure in the customer relationship model, expert's degree of user can exert an influence to the other side mutually;

S3: providing when inquiry, calculating original expert's degree according to expert's degree index information of each user, and provide the original user sorted lists;

S4: the expert's degree of adjusting each user that connects each other according between the customer relationship figure obtains final user's sorted lists.

Preferably, among the described step S1 be each model that the user is sent out as the modeling of a dictionary sample, infer to produce the model of this sample behind, produce the probability of inquiring about thereby calculate this model.

Preferably, set up language model two aspects, the firstth, set up the independent language model M at each model of user _Td, the secondth, regard all models of all users as a text and set up language model M _cThe probability of occurrence of whole inquiry is

p (q | M_{tq}) = \underset{t &Element; q}{Π} (β * p (t | M_{td}) + (1 - β) * p (t | M_{c}))

Wherein, q is query text; β is parameter, is used for weight between the whole model collection of balance and the single model.

Preferably, calculating on model and the optimum answer similarity, employing be vector model; Set up the high dimension vector space respectively for model and the optimum answer of quality to be assessed, calculate the angle between two vectors, the angle of more big then two vectors of the cosine function value of angle is more little, is considered as that the two is more similar, the vectorial similarity of this model and optimum answer

Sim (td, BestAns) = \cos (\overset{&OverBar;}{td}, \overset{&OverBar;}{BestAns}) = \frac{\overset{&OverBar;}{td} * \overset{&OverBar;}{BestAns}}{| \overset{&OverBar;}{td} | * | \overset{&OverBar;}{BestAns |}} .

Preferably, the model collection of all answers of user is regarded as a big text library, this text library is set up language model, calculate the probability that this language model can produce current answer, the knowledge feature architecture that whether derives from the user according to the answer content is judged the quality of answer.

Preferably, the computing formula of length correction employing is

Preferably, user's knowledge feature is stored with text inverted index form among the described step S1, and supports lasting renewal and follow-up cleaning operation.

Preferably, the parameter estimation of the model of setting up among the step S1 can adopt point estimation, square to estimate or maximal possibility estimation.

Preferably, described step S2 sets up the customer relationship model based on the PageRank algorithm of Google.

Preferably, after the expert's degree ordering that calculates the user according to the language model of setting up, revise by user's PageRank value, each user will be multiply by shared problem number as weighting before utilizing this user's PageRank value.

(3) beneficial effect

The present invention is by setting up language model and customer relationship model, extract the expert's degree between user and the data, revise the ordering of user expert's degree by user and the graph of a relation between the user from extracting data then, with respect to the wait after the past enquirement, seek the expert on one's own initiative, problem is pushed to targetedly the user that can answer, improve the accuracy of answer and the speed of answer, can increase the promptness of question and answer network information interaction in the community greatly, meet the growth momentum of question and answer mode community more.

Description of drawings

Fig. 1 is the structure flow chart of the inventive method;

Fig. 2 is assessment indicator comparison diagram in the embodiment of the invention, and every index from left to right is respectively query1-6;

Fig. 3 is another assessment indicator comparison diagram in the embodiment of the invention, and every index from left to right is respectively query1-6.

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for explanation the present invention, but do not limit the scope of the invention.

The structure flow chart of the expert's recommend method based on language model of the present invention as shown in Figure 1, wherein frame of broken lines is indicated algorithm model, solid box is indicated data.Process flow diagram moves from top to bottom from left to right.The top is static, can set up in advance; The below is dynamic, needs to import instant calculating according to the user.The present invention includes step: S1: the content distributed knowledge feature that characterizes relative users of collecting the user, adopt language model in the text retrieval to come user's knowledge feature is carried out text modeling based on probability, have user expert's degrees of data index in the model of foundation; S2: set up the customer relationship model between user and user, have customer relationship figure in the customer relationship model, expert's degree of user can exert an influence to the other side mutually; S3: providing when inquiry, calculating original expert's degree according to expert's degree index information of each user, and provide the original user sorted lists; S4: the expert's degree of adjusting each user that connects each other according between the customer relationship figure obtains final user's sorted lists.

Language model method and derivation

Language model is a kind of of probability model, and the problem that will solve in this patent is exactly to provide maximally related user list therewith according to the problem of importing.That is, obtain p (u|q) according to query text and the existing information of posting of user, wherein u is the user, and q is query text.

According to Bayes' theorem, have

p (u | q) = \frac{p (q | u) p (u)}{p (q)}

For user's probability of same inquiry, the value of p (q) is consistent in relatively, can omit.Evenly distribute at this hypothesis p (u), therefore only need to carry out user's ordering according to p (q|u) and get final product.Here, the model of all answers by user u is as the modeling foundation of u.Therefore

Wherein pf is the set of all models of user u, td1 ∈ pf, and i=1 wherein, 2 ..., n.

Model of the present invention be each model that the user is sent out as the modeling of a dictionary sample, infer to produce the model of this sample behind, produce the probability of inquiring about thereby calculate this model.In the time of modeling, need calculate the generation probability of each word in the text.Parameter estimation can adopt point estimation, and square is estimated, maximal possibility estimation etc.Parameter estimation adopts maximal possibility estimation in experiment.

Maximal possibility estimation (MLE) supposes that there is a probability distribution behind in a text, and this text presentation is a sample of whole distribution.

p(t1，t2，...，tn；θ)＝P(T1＝t1，T2＝t2，...，Tn＝tn)

Wherein θ is unknown parameter.

For make p (t1, t2 ..., tn; Probability maximum θ), to p (t1, t2 ..., tn; θ) ask local derviation, can calculate the value of unknown parameter respectively.Can find that the probability of the appearance of a word t is calculated as according to MLE in the text:

Set up language model two aspects, first is to set up the independent language model M at each model of user _Td, second is to regard all models of all users as a text to set up language model M _cThe probability of occurrence of whole inquiry is

p (q | M_{tq}) = \underset{t &Element; q}{Π} (β * p (t | M_{td}) + (1 - β) * p (t | M_{c}))

Wherein β is parameter, is used for weight between the whole model collection of balance and the single model.Such benefit is that probability is 0 when having avoided the word in the inquiry not appear in the single model, thereby the probability that causes whole inquiry to be produced by single model also is 0.

Basic model

Each model of user can both calculate a probability that produces inquiry.How to be merged into the probability that the whole models of user produce this inquiry be the problem that this patent discusses, i.e. equation to the probability that these models are produced

p(q|M _pf)＝f(p(q|M _td1)，p(q|M _td2)，...，p(q|M _tdn))

The computation process of middle function f.

Equal equal weight

The most basic language model is the weight equalization of each model.After calculating the probability that each model produces inquiry, the whole model collection that addition averages to obtain this user produces the probability of this inquiry.

Algorithm 3-1

For each user

For each term in the inquiry

Each model for the user

Calculate the probability that P (t|td)=this term is produced by this model

P (t|M _Td) be that P (t|td) sum that all models of user produce on average obtains the probability that this user produces this term by the number of posting

The probability multiplication acquisition user who the user is produced each term produces the probability of this inquiry

According to user's probability, ordering output

Benchmark model

Whether effectively benchmark model is to weigh new algorithm normative reference.The normative reference that this patent proposes has two, and one of them is aforesaid basic language model, and in this language model, the weight of each model equates.Another one is only not sort according to user's the amount of posting with reference to content.The latter is the same treating expert's ordering that any inquiry provides, as seen irrelevant ordering still is inflexible with content, irrelevant topic is being talked about and inquired about to a large amount of models of possible user all, and the ordering that has provided by the model number since like this is just very inaccurate.On basic language model, this patent carries out multiple improvement, attempts to seek more effective algorithm.

The optimum answer similarity relatively

Answer on the Answers, if putd question to the people to be chosen to be optimum answer, the model quality height of answering then is described, the this point advantage can be added in the quality evaluation to model, revise the probability that is only calculated by MLE by the distance between user's answer and the optimum answer, no longer each model is regarded as of equal importancely, but higher-quality model is given higher importance degree.A weight is multiply by in taking the form of calculating of revising when producing the probability of query demand by this model.On meaning, this weight is more high, illustrates that then this model quality is more high, and the user is more may become the expert answering this respect problem; On final user's score, after multiply by this weight, the model of different this problems of answer produces the probability difference of this inquiry apart from just having strengthened, and the similarity near optimum answer is more big, the probability that this model can be answered this query demand is more high, finally improves user's mark.

Calculating on model and the optimum answer similarity, employing be vector model.Set up the high dimension vector space respectively for model and the optimum answer of quality to be assessed, calculate the angle between two vectors.The angle of more big then two vectors of the cosine function value of angle is more little, is considered as that the two is more similar.

Sim (td, BestAns) = \cos (\overset{&OverBar;}{td}, \overset{&OverBar;}{BestAns}) = \frac{\overset{&OverBar;}{td} * \overset{&OverBar;}{BestAns}}{| \overset{&OverBar;}{td} | * | \overset{&OverBar;}{BestAns |}}

Algorithm 3-2

For each user

For each term in the inquiry

Each model for the user

The probability * Sim that calculating P (t|td)=this term is produced by this model (td, BestAns)

P (t|td) addition that all models of user produce obtains the probability that this user produces this term

According to user's probability, ordering output

Annotate: (td BestAns) is the vectorial similarity of this model and optimum answer to Sim.

From whole answers, produce the contribution degree of answer

In order to calculate the quality of model, except the similarity that compares model and optimum answer, can also from the model of all answers of user, obtain reference frame.The user is if be proficient in the words of certain aspect knowledge, the word collection relevant with this class knowledge then should be able to often appear in his answer, that is to say, can assert if word often appears in user's the whole model collection and use this word in the model of this answer, then the user is proficient in knowledge in this respect, degree of carrying weight, and in this model, used the professional knowledge of this respect to provide his answer, thus assert that this model should have higher quality.

The existing knowledge that this contribution degree can be regarded as the user has produced current model.Therefore can regard the model collection of all answers of user as a big text library, this text library is set up language model, calculate the probability that this language model can produce current answer.The knowledge hierarchy that whether derives from the user according to the answer content is judged the quality of answer.This contribution degree multiply by the probability that model produces inquiry the most at last, makes high-quality answer faster to the lifting of user expert's degree thus.Have according to language model,

And p (w|M _Pf) with P (w|M _Td) similar, adopt the word frequency of oneself and the method for the word frequency weighting of total collected works to obtain.

Algorithm 3-3

For each user

For each term in the inquiry

Each model for the user

The probability * P that calculating P (t|td)=this term is produced by this model (td, AIITd)

According to user's probability, ordering output

Annotate: (td AIITd) produces the probability of current answer for the existing knowledge of user to P.

In addition, also can consider from the problem angle of being answered by it judge of current answer quality.It sets up language model to problem, wishes to calculate the probability that generates answer from problem, weighs the quality of answer.Thought is if occurred the word in the problem in the answer, can think that the user answers targetedly, rather than answer carelessly.The approach of another consideration is problem to be regarded as by answer generate, and can judge equally whether the user is answering targetedly.Language model is paid attention to the analysis to the content similarity, if the therefore vocabulary that do not overlap of correct option and problem, the method for this tax weights of language model is just not fully up to expectations.

Length standardization

User's answer custom is had nothing in common with each other.At user or length or short answer, add the probability that certain length is calculated with the compensation language model to the user.Because in language model, do not relate to the lateral comparison between the user who answers same problem, just produce the probability calculation of new inquiry.In probability model, because model length is more long, the probability that then produces a word is more little, and therefore, the probability of the likelihood ratio generation short essay of one piece of long article of generation is much smaller.

Be exemplified below, document A content is XYY, and document B content is XY.If inquiry q is XY, then the probability of document B generation q is

\frac{1}{2} \times \frac{1}{2} = \frac{1}{4}

And the probability of document A generation q is

\frac{1}{3} \times \frac{2}{3} = \frac{2}{9} < \frac{1}{4}

But on directly perceived, the described content degree of correlation of A is greater than B's.

Different for the mark that causes different in size that reduces user's answer, statistics is answered the total length of all models of same problem, revises according to the own component that accounts in total length again.In this patent, the computing formula that length correction is used is

Algorithm 3-4

For each user

For each term in the inquiry

Each model for the user

Calculate the probability * UserPortion (td) that P (t|td)=this term is produced by this model

According to user's probability, ordering output

The customer relationship model

It all mainly is at unique user itself that language model and weight thereof are improved, but in community, all can form a network between user and user unintentionally, hiding the contact between them in this network, if these contacts are excavated out, perhaps can judge expert's degree of user to play a positive role.For example its knowledge hierarchy of the user who answers a question under same problem should partially overlap, when often appearing at together, user A and user B answer a question, and be that another also is the expert to the expert probably if then judge one of them.

The applicability of PageRank

The PageRank algorithm is that Google is used for an algorithm of presentation web page rank degree, and its core concept is if a webpage is linked by a lot of other webpages, illustrates that it is subjected to general admit and trusting, so its rank height just.This algorithm carries out the calculating of system with network from the angle of a figure, only pays close attention to the inquiry of a website and it the receiving scope of limitation like this and be unlike in research before it.

Equally, in expert's degree rank of user, can utilize the thought of PageRank to remedy the defective of only calculating expert's degree by content.If the word of inquiry is concentrated rare appearance at a user's model, even then high-quality answer, can be because of not matching and can't obtain high probability on the content yet.This expert user can be by answering same problem other users' height assign to remedy oneself low score value.Therefore, our the shared number of times that need excavate between the user to define sharing degree to each user.It is more high to wish to share number of times, and then two people's PageRank value all becomes very high, can revise the expert's degree of oneself for the low side of score value.This has just in time met the intention of PageRank algorithm.

Utilize the PageRank thinking to obtain sharing degree

Regard each user as a node, if appear at simultaneously in the same questions answer then add the link of pointing to the other side mutually.Set up the two-dimensional plot of Co-occurrence by above method, the number of the problem of Hui Daing is as the weight on the limit that connects two users simultaneously.After obtaining customer relationship figure, utilize iteration to draw each user's sharing degree.

Suppose that the initial PR value of all users is identical, each user goes out this initial value branch according to the weight on the limit of pointing to other users.Traversal finishes, and each user adds up the own PR sum that obtains from others there, as own new PR value.PR value after about 30 iteration of process obtain restraining.No matter how verified theoretically initial value is chosen, this algorithm has guaranteed that all user's PR value can converge to their actual value.

Algorithm 3-5

For user u

The weight sumofW on all limits that calculating is pointed out;

To each user u '

According to

Calculate and dispense the PR value

Be added on the new PR value of this user

With above iteration to the PR value stabilization

Annotate: w (u, u ') is user u and between u ' shared model number

How to add the PageRank value in the ordering

After the expert's degree that calculates the user according to language model is arranged, revise by user's PageRank value.If a user's PageRank value is higher, he is described often and other users answer a question together.When utilizing the PageRank value to revise, can not simple interest be worth to do weighting with this, because the problem number that each user and this user share and inequality.

Each user will be multiply by shared problem number as weighting before utilizing this user's PageRank value.Namely, if only shared several problems between party A-subscriber and the party B-subscriber, and and shared up to a hundred problems between the C user, then A is in correction self expert's degree, can access expert's value of more C, and A because and the problem shared of B few, can only obtain expert's value of a spot of B, can not be benefited even the sharing degree of B is very high, because the sharing degree height of B is not caused by A.

Utilize the PR value as follows from the modified value that other users obtain in the experiment there

Δ (u) = \underset{u^{' -} u}{Σ} oldscore (u^{'}) * \frac{w (u, u^{'}) * PR (u^{'})}{Σ_{u^{'} &NotEqual; u} w (u, u^{'}) * PR (u^{'})}

Expert's degree after then the user upgrades is

newscore(u)＝δ*oldscore(u)+(1-δ)*Δ(u)

Algorithm 3-6

For user u

Travel through each user u '

The weights of expert's degree that u can take are the PR value of u ' and the product of w (u, u ')

Expert's degree of giving u be according to

Calculate expert's degree that distribute

Old expert's degree weighting according to certain ratio and u obtains new expert's degree

Annotate: w (u, u ') is user u and between u ' shared model number

Experimental analysis

The selection of experimental data and processing

Computer﹠amp on the Answers; Data under the Internet subclass.100 users and more than 10,000 data that they deliver have been chosen.Left out numeral in the content of model, symbol has been ignored capital and small letter, and special the processing is not done in the link that comprises in the model, only is used as general text and treats.

Such processing may have been lost some information.Capital and small letter may comprise specific meaning, MAC for example, and what refer to is the physical address that is solidificated on the network interface card, is the notebook computer that Apple company produces and Mac is often referred to generation.But capital and small letter also may be just in order to cause that the reader notes, HELP for example, THANKS then and help, thanks does not have difference.Because numeral occurs in the text and specific implication is arranged unlike word, content common and article does not have positive connection, does not therefore list the calculating of probability contribution in.

Calculate an expert (user) rank according to inquiry after, need to tabulate relatively with actual expert.Whether do not provide detailed expert's tabulation on the Answers, therefore need manually each user to be weighed, determining can be as the expert of this inquiry.When whether assessment is the expert, need browse all problems of answering of this user.Established 6 new problems altogether, 100 possible experts (user) have been made judgement.Evaluation result is divided into relevant and uncorrelated two-stage, does not divide more multi-level.

Experimental result and analysis

The introduction of the parameter of assessed for performance

MAP: the MAP of single query is the mean value of the accuracy rate after each relevant expert retrieves out.The MAP of the overall situation is the mean value of the MAP of 6 inquiries.

MRR: the mean value of the position of expert in ordering that first is relevant.It can illustrate in order to find the relevant expert need along the user's sorted search distance how long that provides.

R-Precision: in the result for retrieval, in the locational accuracy rate of all relevant expert sums

Pn: the ratio that comprises correct relevant expert in preceding n ordering.

Parameter chooses in the experiment

The parameter of language model is selected

Two parts are depended in the foundation of language model, and one is the contribution degree of single model, and another is the contribution degree at whole model number.The latter plays level and smooth effect.The weight of remembering the contribution degree of single model is β, has carried out following experiment with regard to the value size of β:

Produce in the model of answer in problem:

It is best that the value that can see β shows in 0.5, and wherein the variation of MRR is particularly evident, and amplification reaches more than 22%.On other indexs, the difference that three kinds of values cause is not obvious.

Post in the model that has problems according to user's history:

In this model, the performance of β and very inequality in a last model, in the time of β=0.9, the performance of model is best, but the difference between three values is not obvious generally.For two models more than the balance and after the performance difference that brings owing to choosing of β of multi-model more, the default value of β is set at shows 0.7 of the golden mean of the Confucian school comparatively.

Choosing of Co-occurrence weight

For expert's degree adjustment of the user that can be occurred together on original user mark basis, expert's degree that the branch that need make weight is equipped with the user who determines answer simultaneously to what extent influences the user.Parameter δ represents the user from the weighting ratio that accounts in new mark, opposite extreme situations is δ=0, is namely decided expert's degree of this user fully by other users of while answer.Equally two models are tested.Wherein, β=0.7.

A) produced the model+rerank of answer by problem

B) model+rerank that posts and have problems according to user's history

Different with the data of a last model, when the highest value of MRR appears at δ=0.4, and in δ=0.4th,, 3 sample data reach the highest.For the difference of bringing between comprehensive different models, select the most stable δ=0.8 of performance to be used as default value.

The performance comparison

The contribution degree of model changes

Datum line is following two:

Baseline 1: sort out according to the model amount of replying

Baseline 2: calculate according to the contribution of weight equalization.

Experimental technique comprises five:

Method 1: the probability that the problem that answer is answered by it produces

Method 2: the probability that model is produced by the historical model of user

Method 3: the similarity of the optimum answer of model and institute's question and answer topic

Method 4: the ratio of the length of model in all replying

List experimental result below:

Can see that baseline 2 is the most weak in performance, that is, the weight of each model of mean allocation is unfavorable for grasping expert's degree of user.Because it does not stress the model that the user delivers, so that each model contribute is identical, give prominence to user's knowledge emphasis in the knowledge of whole user.And baseline 1 is lower slightly in whole result, and its performance is compared with other models does not have huge difference.Main cause is Data Source.Baseline 1 sorts according to model quantity, and the data decimation of current forum derives from same catalogue (Computer ﹠amp; Internet) the sub-forum under.More relevant at computing machine and a lot of topics of network facet, similar knowledge is easy to grasp together, can be inclined to really and is judged as the expert so answer maximum people under this forum.If the replacing data source, for example the sub-forum with Computer and these two forums of History combines, and then arranges expert's degree according to the amount of posting again, and performance is inevitable very low.Maximum and under Computer, post at most because under History, post, mutually knowledge system is not each other all understood very much.

Baseline 1 is not based on the ordering of language model, and in the comparison of baseline 2 and Method 1, Method2 as can be seen, weight between the model changes very important.P10 among Method 1 and the Method 2 is 0.6167, and P1-only is 0.3333 among the baseline 2.This shows that the quality of analyzing model from content is feasible.No matter be to judge or judge according to user's history according to problem, all still adopt language model to produce probability.The specific ken of each model of user as the user, determine the expertise scope that the user is total according to different weightings.

Having outstanding performance of Method 3 is MRR near 80%, only this means to find first or second just can access first correct expert along the ordering that provides.Why Method 3 can have good performance to be because come the comparison similarity with optimum answer at MRR is the mode that quality is answered at present more outstanding judgement really.The generation of optimum answer is manually selected by the quizmaster, and this step is the advantage that other several methods do not have.Other method is in the quality of coming from the various parameters of model to judge model with the method for Modeling Calculation, and the parameter that the high-quality model has is likely the speciality that does not meet modeling, length for example, keyword word frequency.Therefore if can be directly and the optimum answer comparison, be only the most effective judgement model method for quality, it should be the most rational close to the quality of manually being judged this model by the people.

Method 4 is not that MRR and P5 are higher in three kinds of methods of modeling under having situation about manually judging, but other several indexs of this method are all not too outstanding.This method is that reducing on the probability of wishing to bring long model remedies.The performance that shows at it can be understood as the degree of correlation that this method can increase long model really, but the probability of short model is weakened also very serious, makes it to find out the expert well in several preceding, thereby improves MRR and P5.But too make that for long model increases weight the expert's that has short model rank is not forward, cause other overall target to be given prominence to.

Performance about length standardization compares

Because Method 1 and Method 2 have adopted language model when the calculating of model weight, also add in above-mentioned two kinds of methods and go so I attempt weight with Method 4, see that whether can adapt to language model more changes the sensitivity of length.

Method 4 wishes to reduce on the probability the long model of length to be brought negative effect by the compensation on the length.Performance is all right in isolated operation Method 4, can be also not very very poor though do not give prominence to.But applying to Method 1 and effect and bad above the Method 2.Most index descends to some extent.

Sum up reason and may be since length compensation in the direct weighting as the model of language model, can adequate compensation the probability problem less than normal of model of long width of cloth, but ought exist another set ofly by adding temporary that language model is set up, directly use length compensation just to muddle things up again.Because the language model as weighting is processed again according to model, so the length of model no longer is simple linear relationship, become complicated, also no longer can compensate by the coarse length standardization that is directly proportional.

The performance comparison of Rerank

Done the weighting adjustment according to the two-dimensional plot that occurs simultaneously between the user according to the PageRank value.After adjusting, can see certain methods has still been reached effect, the especially MRR that improves performance.Method 1 MRR behind rerank brings up to 0.68, Method2 MRR behind rerank from 0.58 and brings up to 0.77 from 0.58.The weight of these two kinds of methods is produced by probabilistic Modeling, whether any one word occurs and the frequency of occurrences all can produce bigger influence to probability, thereby after comprehensive co-occurrence figure, other users' answer is arranged as stable substrate, reduced the undulatory property of the variation of accidental appearance.

The optimization DeGrain of method on Method 3 and Method 4.Reason is interpreted as that Method 3 is contribution degrees of adjusting model according to the similarity of optimum answer, and it is more accurate to estimate, and the corrective action of co-occurrence figure is little; And the weight adjusting range of long model is bigger than normal among the Method 4, the expert who has a long and short model is more serious in the ordering differentiation, and the user of long model is forward, lack the leaning on of model after, after the weight of comprehensive co-occurrence figure, have to present the situation of gathering to the centre.Therefore P5 descends, and P10 and P15 have small size lifting, shows that the expert has the trend that becomes close at the tabulation middle part.

Performance difference between Query

Above test and appraisal all are to do discussion by the average level of 6 query.In different query, difference is also bigger, discusses with the best Method 2+Rerank of combination property below.

Can find that in Fig. 2 query3 all indexs in these 6 query all are very low.MRR just arrives 0.1.Query1, query2 and query6 can both reach 70% substantially and reach above performance, and whole results of query3 but pace up and down 0.1～0.2 all the time.Query4 and query5 belong to by-level in performance.

Difference on the model does not produce big influence in the relative scale of these 6 query.In the Method 3 that the similarity by optimum answer that does not add rerank assigns weight, the assessment indicator of 6 query as shown in Figure 3:

Similar with Fig. 2, query3 is that these 6 intermediate performances are the poorest all the time as can be seen, query1,2,6 performances better.As seen, the difference of model is not the main cause that causes this result, and maximum difference is their content between query.

6 query the contents are as follows:

Query1, why 2,6 can show well, and main cause has two:

Content energy and the occurring words of asking relatively meet." buy " of query2 for example, " macbook ", though query 2 is very short, the content that ask all occurs in query, and these several words can express clearly what the content of wanting is, targetedly.

The word that the key words that occurs and total model are concentrated has the coincidence of high frequency." surf " among the query6 for example, " internet " all is to offer some clarification on very much the word of puing question to direction, concentrates at total model can occur because of their high frequency and improve expert's degree of this user

Just there is following problem and show among the bad query:

Enquirement does not have specific aim.Query4 for example, content does not fully simultaneously have special keyword can show the main demand of oneself inadequately yet." slow " and " speed " is easily at whole C ompuer ﹠amp; Produce ambiguity in the forum of Internet.What only may put question to according to keyword is the software and hardware problem of Computerz subclass inquiry computing machine, also may be the problem of inquiry network congestion under the Internet subclass.Therefore, be difficult to provide accurate user expert's degree.For another example, " clean " word among the query3, can notice among the query5 also has " clean ", however a cleaning that refers to outward appearance, another refers to the cleaning of virus and garbage files

Even the word of puing question to the accurately meaning of expression problem also belongs to the low-frequency word that entire document is concentrated.

Content among the query3 is longer in 6 query, and main keyword is " dirty ", and " clean ", targetedly.Why the lower reason of performance in each model may be the number of times that occurs of dirty one word seldom, whole collected works are 894808 words altogether, on average the number of times that occurs of each word is 50, and the frequency that dirty occurs only 15 times.See following table for details,

The above only is preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the technology of the present invention principle; can also make some improvement and replacement, these improvement and replacement also should be considered as protection scope of the present invention.

Claims

1. the expert's recommend method based on language model is characterized in that, comprises step:

Each model that the user is sent out is inferred to produce the model of this sample behind as the modeling of a dictionary sample, is produced the probability of inquiring about thereby calculate this model;

Set up language model two aspects, the firstth, set up the independent language model M at each model of user _Td, the secondth, regard all models of all users as a text and set up language model M _c; The probability of occurrence of whole inquiry is

p (q | M_{td}) = \underset{teq}{Π} (β * p (t | M_{td}) + (1 - β) * p (t | M_{c}));

Wherein, q is query text; β is parameter, is used for weight between the whole model collection of balance and the single model; T ε q is the word in the query text; Td is the single model that the user sent out; C is all models of all users, i.e. whole model collection; P(t|M _Td) be the probability that all models of user produce query word t; P (t|M _c) be the probability of all models generation query word t of all users; P (q|M _Td) be the probability of all models generation query text q of user;

Described customer relationship model is based on that the PageRank algorithm of Google sets up;

2. the method for claim 1 is characterized in that, calculating on model and the optimum answer similarity, employing be vector model; Set up the high dimension vector space respectively for model and the optimum answer of quality to be assessed, calculate the angle between two vectors, the angle of more big then two vectors of the cosine function value of angle is more little, is considered as that the two is more similar, the vectorial similarity of this model and optimum answer

Sim (td, BestAns) = \cos (\overset{&RightArrow;}{td}, \overset{&RightArrow;}{BestAns}) = \frac{\overset{&RightArrow;}{td} * \overset{&RightArrow;}{BestAns}}{| \overset{&RightArrow;}{td} | * | \overset{&RightArrow;}{BestAns} |};

Wherein, BestAns is optimum answer; (td BestAns) is single model that the user sent out and the vectorial similarity of optimum answer to Sim.

3. the method for claim 1, it is characterized in that, the model collection of all answers of user is regarded as a big text library, this text library is set up language model, calculate the probability that this language model can produce current answer, the knowledge feature architecture that whether derives from the user according to the answer content is judged the quality of answer.

4. the method for claim 1 is characterized in that, the computing formula that length correction adopts is

5. the method for claim 1 is characterized in that, user's knowledge feature is stored with text inverted index form among the described step S1, and supports lasting renewal and follow-up cleaning operation.

6. the method for claim 1 is characterized in that, the parameter estimation of the model of setting up among the step S1 can adopt point estimation, square to estimate or maximal possibility estimation.

7. the method for claim 1, it is characterized in that, after the expert's degree ordering that calculates the user according to the language model of setting up, revise by user's PageRank value, each user will be multiply by shared problem number as weighting before utilizing this user's PageRank value.