CN101621391A

CN101621391A - Method and system for classifying short texts based on probability topic

Info

Publication number: CN101621391A
Application number: CN200910090377A
Authority: CN
Inventors: 刘文印; 权小军; 张加龙
Original assignee: BEIJING BAIWENBAIDA NETWORK TECHNOLOGIES Co Ltd
Current assignee: BEIJING BAIWENBAIDA NETWORK TECHNOLOGIES Co Ltd
Priority date: 2009-08-07
Filing date: 2009-08-07
Publication date: 2010-01-06

Abstract

The invention discloses a method and a system for classifying short texts based on the probability topic; the method is used in a data processing device of a question answering system for classifying the short texts according to the similarity of the short texts. The method comprises the following steps: respectively acquiring initialized text vectors according to an input target short text and a short text acquired from the database of the question answering system; scanning the two short texts to respectively acquire differentiating words of the two short texts; when the relevance degrees of the differentiating words of the two short texts and a probability topic are higher than a threshold, modifying the text vectors of the two short texts according to the relevance degrees; working out the similarity of the two short texts according to the modified text vectors of the two short texts; acquiring another short text form the database of the question answering system till all short texts in the database are traversed; and then executing a scanning step; classifying the target short text according to the similarity.

Description

Carry out the method and system of short text classification based on probability topic

Technical field

The present invention relates to utilize data mining to carry out the areas of information technology of text-processing, particularly relate to a kind of method and system that carries out the short text classification based on probability topic, and the short text search method, the spam recognition methods.

Background technology

In today of information technology develop rapidly, the user can obtain bulk information by multiple channel.For example by browsing page, utilize search engine to retrieve, receive note, mail, operational network question answering system etc.But the problem that occurs is data magnanimity, poor information usually.

For example, occur a large amount of mails in the mailbox, wherein both comprised normal work mail or life mail, comprise spam again.Comprise a large amount of useless advertisement SMSs in the note that receives.In the network question answering system, when the user proposes a problem, may obtain a large amount of answers from other users.Possible some departed from theme, some answer very accurate.When in webpage, retrieving a keyword, obtain a large amount of web page interlinkages that comprises this keyword.Wherein some web page interlinkage can embody its corresponding webpage and possesses the content relevant with this keyword, and the correlation of some web page interlinkage is relatively poor.A large amount of models in the forum are the time sequencing free arrangement, independent separately, the arrangement of jumping of content.

Because how gigantic project when being reluctant maybe can not traverse every data, obtains active data from mass data, the perhaps data of potentially useful, or be to get rid of unnecessary data are problem demanding prompt solutions as the user.

Current, mass data occurs with the form of short text.The answer of for example above-mentioned note, mail title, web search, network question answering system etc.If can realize automatic classification based on its inner link (for example semantic relation) is carried out in the answer of note, mail title, web page interlinkage, network question answering system, by the user categories of interest (content) is read one by one again, perhaps, identification shielding rubbish information will significantly reduce the time that the user is spent when locating valid data.Generally speaking, dwindle the scope that the user need check one by one, that can significantly improve the user consults convenience and operating efficiency, improves user experience.

For achieving the above object, at first need short text is carried out data mining, particularly carry out the differentiation of classification according to its inherent meaning relevance.

In the prior art, exist by calculating similarity between the text and carry out the method for class discrimination.It is by representing that with its word frequency vector (or being called text word frequency vector, term-frequency vector) distance of calculating then between the word frequency vector obtains text similarity to text.Yet the method for most of existing calculating text similarities only is applicable to long text, and why effective traditional calculating long text similarity based method is, is because similar long text has comprised the identical vocabulary of some usually.But for short text, similar short text might not have identical speech, and the flexibility of natural language makes people to express the identical meaning by different words, so existing similarity calculating method effect and bad.

Proposed some solutions at this problem in the prior art, its thought generally is by means of a dictionary (as WordNet) short text to be expanded.In general, this method can be found the correlation between speech and the speech more accurately, and still, many times the true relation between speech and the speech depends on concrete application background.Just concrete semantic environment.

How at the few characteristics of vocabulary identical between short text, and according to the contact of semanteme, mark off different classes of, and particular category extracted or exclude, realizing the high speed location valid data in the information services such as short message service, search service, automatic question answering service, mail service, is to need the problem that solves in the present industry.

Summary of the invention

The problem that the present invention solves is, at information service high speeds such as short message service, search service, automatic question answering service, mail service location valid data, short text is classified based on its inherent semantic relation, and then extract target data efficiently based on this classification.

The invention discloses a kind of method of carrying out short text classification, be applied in the data processing equipment of question answering system, be used for short text being classified, comprising according to the similarity between short text (should be) with being problem or answer based on probability topic:

According to a target short text and a short text that is obtained from the database of question answering system of input, obtain its initialized text vector respectively;

Scan this two short text, obtain the differentiating words of this two short text respectively;

When the correlation degree of the differentiating words of this two short text and a probability theme all is higher than threshold value, according to the text vector of this this two short text of correlation degree correction;

Calculate similarity between two short texts according to the text vector of revised this two short text;

Obtain another short text in the database of question answering system, the short text in database is finished by traversal, carries out this scanning step;

According to this similarity target short text is classified.

The invention also discloses a kind of system that carries out short text classification based on probability topic, be applied in the data processing equipment of question answering system, be used for short text being classified, comprising according to the similarity between short text (should be) with being problem or answer:

Initialization module, a target short text and a short text that is obtained from the database of question answering system according to input obtain its initialized text vector respectively;

The differentiating words identification module is used to scan this two short text, and obtains the differentiating words of this two short text respectively;

Judgment amendment module is when the correlation degree of the differentiating words of this two short text and a probability theme all is higher than threshold value, according to the text vector of this this two short text of correlation degree correction;

Computing module calculates similarity between two short texts according to the text vector of revised two short texts;

Sort module is classified to target short text according to similarity.

The invention also discloses a kind of method of carrying out the short text retrieval, be applied to comprise the steps: in the search engine server based on probability topic

Step 1 receives the retrieval short text as term that transmits from the search subscriber terminal;

Step 2 is obtained a short text from the database of search engine server;

Step 3 according to above-mentioned retrieval short text and the short text that obtains, is obtained its initialized text vector respectively from database;

Step 4 scans this two short text, obtains the differentiating words of this two short text respectively;

Step 5 is when the correlation degree of the differentiating words of this two short text and a probability theme all is higher than threshold value, according to the text vector of this this two short text of correlation degree correction;

Step 6 is calculated similarity between two short texts according to the text vector of revised this two short text;

Step 7, circulation execution in step 2, the short text traversal in database finishes;

Step 8 is extracted and the short text of the similarity of retrieving short text greater than predetermined value, is sent to the search subscriber terminal and shows.

The invention also discloses a kind of method of the identification spam based on probability topic, be applied to comprise the steps: in the mail server

Step 1, storage model spam short text in the database of mail server, and obtain one of them;

Step 2 is obtained a short text from the database of mail server;

Step 3 according to above-mentioned two short texts, is obtained its initialized text vector respectively;

Step 7, the circulation execution in step 2 whole short text traversals in database finish, and after the whole short text traversals in database finished, execution in step 1 all traveled through until model spam short text and finishes again;

Step 8 will be designated spam greater than the short text that obtains of predetermined value with the similarity of this model spam short text from the database of mail server.

The invention also discloses a kind of method based on shielding rubbish, advertising message in the portable terminal of probability topic, be applied in the fire compartment wall of mobile communication system, store the model short text in the fire compartment wall, mobile communication system comprises short message server, fire compartment wall, user terminal, comprises the steps:

Step 1, fire compartment wall receive the short message text that this short message server sends;

Step 2, fire compartment wall are obtained a model short text;

Step 3 according to above-mentioned this short message text and this model short text, is obtained its initialized text vector respectively;

Step 7, circulation execution in step 2, all the model short text traversals in fire compartment wall finish;

Step 8, the similarity of fire compartment wall interception and model short text is higher than this short message text of a predetermined value.

The effect that the present invention realizes is that in question answering system, the problem at the user proposes according to its inherent meaning, is included into the classification space of a whole page accurately in the location automatically with it, makes that the logic correspondence of the problem that each classification space of a whole page is included with it is strong.The field that the answer user of being convenient to answer a question is good at according to self selects the space of a whole page to answer, and the problem that makes obtains very fast authority's response.Simultaneously, the inventive method has improved the accuracy that the short text similarity is calculated, and can give the expert who is familiar with this problem domain with question recommending more accurately.Make question answering system as the platform of information interchange, make the distribution of information more accurate.

In the short text search field, search method of the present invention is based on probability topic and retrieves, and whether reaches threshold value to judge its correlation in terms of content by judging similarity, has therefore also improved the accuracy of short text search return results greatly.

In getting rid of spam, shielding advertising message field, but effective recognition discharging garbage mail, advertising message, and indicated, shield.Improve the efficient and the accuracy of spam, advertising message identification.

Description of drawings

Figure 1A, 1B, 1C, 1D are depicted as a kind of structural representation that carries out the system of short text classification based on probability topic;

Fig. 2 A, 2B are for carrying out the flow chart of short text classification based on probability topic;

Fig. 3 is for carrying out the flow chart of question recommending based on probability topic;

Fig. 4 A is the structural representation of search engine server;

Fig. 4 B is for carrying out the flow chart of short text retrieval based on probability topic;

Fig. 5 A is the structural representation of mail server;

Fig. 5 B is for discerning the flow chart of spam based on probability topic;

Fig. 6 A is the structural representation of the present invention's one Application Example;

Fig. 6 B is the schematic flow sheet of the present invention's one Application Example;

Fig. 7 A, 7B are the little average and grand mean value schematic diagram of getting the F1 that the different themes number obtains on two data sets;

Fig. 8 A, 8B are depicted as the little average and grand mean value schematic diagram of getting the F1 that different λ values obtain on two data sets.

Embodiment

The invention discloses a kind of method and system of carrying out short text classification, can find out true relation between word and the word, and then calculate the similarity between the short text according to probability topic based on probability topic.With localizing objects data efficiently.

In the data processing equipment of the fire compartment wall of a mail server, search engine server, mobile communication system or question answering system server end, a kind of system that carries out the short text classification based on probability topic is set.

With the question answering system is example, and question answering system is an online interaction system, also is question answering system, realizes the computer processing system of user interactions question and answer.And to please refer to number of patent application be 200510130778.5 Chinese patent.

The technical problem that the present invention solves in question answering system is, the problem that proposes at the user according to its inherent meaning, is included into accurately that the classification space of a whole page shows in the location automatically with it, makes that the logic correspondence of the problem that each classification space of a whole page is included with it is strong.The field that the answer user of being convenient to answer a question is good at according to self selects the space of a whole page to answer, and the problem that makes obtains very fast authority's response.

See also shown in Figure 1A, the 1B, be a kind of structural representation that carries out the system of short text classification based on probability topic of the present invention.

The system 100 that carries out the short text classification based on probability topic comprises question answering system server end 10 and a plurality of terminal 20.Comprise a communication connection interface 11, database 12, data processing equipment 13, display module 14, input module 15 in the question answering system server end 10.

Communication connection interface 11 is used to receive the data from a plurality of terminals 20, for example target short text such as the problem that sends by terminal 20 of user or answer.Data processing equipment 13 is used for target short text that receives by communication connection interface 11 and the short text that extracts from database 12 are successively carried out analyzing and processing.Input module 15 is used for input parameter or supervisory instruction.Display module 14 is used for the short text after handling is by analysis shown.

Further be provided with a plurality of modules in this data processing equipment 13.Comprise:

Initialization module 1301 at a target short text that receives from communication connection interface 11 and the short text that obtains, and obtains its initialized text vector respectively from database 12;

Differentiating words identification module 1302 is used to scan this two short text, and obtains the differentiating words of this two short text respectively; Wherein, differentiating words be in a short text, occur and in another short text absent variable word.Differentiating words can hint out two relations between the short text usually.

Judgment amendment module 1303 is when the correlation degree of the differentiating words of this two short text and a probability theme all is higher than threshold value, according to the text vector of this this two short text of correlation degree correction;

Probability topic abstraction module 1304, the generating probability theme is used to sample;

Computing module 1305 calculates similarity between two short texts according to the text vector of revised this two short text;

Sort module 1306 is classified to target short text according to this similarity.

This sort module 1306 can be carried out concrete classification by the mode that adopts k nearest neighbour classification device.K nearest neighbour classification device is a known technology, its basic operation principle is, calculate target short text and be present in the similarity of the short text in the classification with all, select the highest K of a similarity text, the classification according to this K short text place determines which classification is target short text assign in the most at last then.

The specific implementation method is, calculates the correlation of each classification (some classifications at K short text place) and this target short text, and target short text is assigned in that classification of correlation maximum the most at last.Wherein, the computational methods of the correlation of classification and target short text are as follows, for example, set K=5, promptly choose 5 with the highest short text of target short text similarity, suppose short text d1, d2 belongs to classification C1, short text d3, d4, d5 belong to classification C2.The similarity of target short text and d1 represents with S1, and the similarity of target short text and d2 represents with S2, by that analogy.If S1+S2＞S3+S4+S5 assigns to target short text among the classification C1 so, otherwise assigns among the C2.Also optional other numerical value of K.

Except that k nearest neighbour classification device, also can adopt additive method realization in the prior art, as centroid method.

The barycenter ratio juris is that each classification is expressed as a vector, calculates the similarity of target short text and each categorization vector, finds out the classification of that classification of similarity maximum as target short text.The method that each classification is expressed as vector is, the vector of all short texts in each classification is weighted on average, obtains an average vector as categorization vector.

Described differentiating words identification module 1302 further comprises: contrast module 1 and differentiating words set acquisition module 2.This contrast module 1 is used for this two short text is compared; This differentiating words set acquisition module 2 is used for obtaining respectively the differentiating words set of each short text.

Described judgment amendment module 1303 further comprises:

Module 3 chosen in theme, is used to choose a probability topic t _i, search for respectively in the differentiating words set of two short texts at this probability topic t _iThe speech of last probability maximum;

Judge module 4 is used to judge that selected two differentiating words are at probability topic t _iOn probability whether all greater than threshold value, if all greater than, carry out correcting module 5; Otherwise, carry out circulation module 6;

Correcting module 5 is used for initialized text vector is revised.

Circulation module 6 is used to judge whether also have theme not to be selected, if, carry out theme and choose module 3, if not, carry out computing module 1305.

Introduce technical scheme of the present invention in detail by the following examples with reference to the above-mentioned system that carries out the short text classification based on probability topic, and in conjunction with idiographic flow.What see also Fig. 2 A carries out the flow chart that short text is classified based on probability topic.

Store a large amount of short text data that comprise problem, answer in the database 12, constitute a text set D.Text set D is made up of M short text, comprises N word that differs from one another, and promptly text set D has made up dictionary W={w ₁, w ₂..., w _N.And text set D comprises Z theme T={t ₁, t ₂... t _Z.d _iRepresent a text among the D.

Terminal 20 transmission one target short text d ₁Be sent to question answering system server end 10.

Step 201, the content in data processing equipment 13 these databases 12 of scanning utilizes initialization module 1301 to obtain one at random or in proper order not by traversal short text d ₂, and to these two short text d ₁, d ₂Carry out the initialization of text vector respectively.

Its text vector of difference initialization V ⁽¹⁾, V ⁽²⁾For:

V^{(1)} = {v_{1}^{(1)}, v_{2}^{(1)}, \cdot \cdot \cdot, v_{N}^{(1)}}

V^{(2)} = {v_{1}^{(2)}, v_{2}^{(2)}, \cdot \cdot \cdot, v_{N}^{(2)}},

Wherein, v _i ^(j)Be the w in the dictionary _iAt short text d _jIn weight (if do not occur this w in the short text _i, v so _i ^(j)Be zero); v _i ^(j)Calculate with the TF-IDF method.TF-IDF is the state of the art.This kind initial method only is a kind of embodiment, carries out initialization with other similarity methods and is also included within the scope of the present invention.

In the TF-IDF method, v _i ^(j)Can calculate by following formula:

v_{i}^{(j)} = {tf}_{ij} \times \log (M / {df}_{i}),

Wherein, M is the short-and-medium number of texts of text set D; Df _jBe to comprise speech w among the text set D _iThe number of short text; Tf _IjBe w _iAt short text d _jThe middle number of times that occurs.

Additive method such as TF method are promptly only calculated tf _IjPerhaps simple 0-1 assignment method, if promptly comprise i word in the short text, v so _i ^(j)Be 1; Otherwise be 0.

Because the speech number that short text comprises is limited, some noise speech (for example, the speech that all occurs in many texts) can influence the short text calculation of similarity degree, therefore can use TF-IDF to slacken the influence of these speech.

Step 202, data processing equipment 13 utilizes differentiating words identification module 1302, obtains d ₁, d ₂Differentiating words set.

That is, with this two short text d ₁, d ₂Compare, obtain the differentiating words set of each short text respectively.Differentiating words be in a short text, occur and in another short text absent variable word.Differentiating words can hint out two relations between the short text usually.

For example, two short texts " price of apple is very high " and " price of banana is very low " are arranged, remove after the stop words, the method for traditional calculating short text similarity can only be found a co-occurrence word " price ".But if can find third party's media, its theme is " fruit ", and so, by " apple " compared with this theme with " banana ", just can find has certain relation between apple and the banana.And then find to have incidence relation between two short texts.

Particularly utilize contrast module 1 in the differentiating words identification module 1302 with this two short text d in this step ₁, d ₂Compare, and this differentiating words set acquisition module 2 is used for obtaining respectively the differentiating words set of each short text.

The described differentiating words set of obtaining is:

Dist (d_{1}) = {w | w &Element; d_{1}, w &NotElement; d_{2}}

Dist (d_{2}) = {w | w &Element; d_{2}, w &NotElement; d_{1}}

Step 203, select a probability theme, utilize judgment amendment module 1303, judge the differentiating words of this two short text and the correlation degree of probability topic, if the relevance of two differentiating words and same probability topic all is higher than a threshold value, according to the text vector of this this two short text of correlation degree correction.

This probability topic is by extracting among the probability theme collection T, and this probability topic collection is to utilize probability topic abstraction module 1304 to extract by the gibbs sampler method to obtain.Before the step of utilizing the gibbs sampler method to extract probability topic can be executed in step 201, this gibbs sampler was a common practise of the prior art, and related description is seen below continuous the description.Certainly, taking other method of samplings to obtain probability topic also can.

Further comprise following implementation in the step 203:

Step 2031 is utilized theme to choose module 3 and choose a probability topic t from T _i

Step 2032 is utilized judge module 4, calculates Dist (d respectively ₁), Dist (d ₂) in each differentiating words at this probability topic t _iOn probability, select short text d ₁Differentiating words set at this probability topic t _iThe differentiating words w of last probability maximum _m, and, short text d ₂Differentiating words set at this probability topic t _iThe differentiating words w of last probability maximum _n

Step 2033 is judged two selected differentiating words w _m, w _nAt this probability topic t _iOn probability whether greater than a preset threshold value λ, if all greater than, represent that these two speech are correlated with, execution in step 2034, otherwise it is uncorrelated to represent them, execution in step 2035.

That is, this step judge to as if p (w _m| t _i) 〉=λ and p (w _n| t _i) 〉=λ, p (w _m| t _i) represent word w _mBelong to theme t _iProbability.

This preset threshold value λ need obtain by input module 15 inputs.

Step 2034, utilize correcting module 5 according to following formula to V ⁽¹⁾, V ⁽²⁾Revise:

v_{n}^{(1)} = v_{n}^{(1)} + v_{n}^{(2)} \times P (w_{n} | t_{i})

v_{m}^{(2)} = v_{m}^{(2)} + v_{m}^{(1)} \times P (w_{m} | t_{i}),

Wherein, V _n ⁽¹⁾Represent V ⁽¹⁾In n element, V _n ⁽²⁾Represent V ⁽²⁾In n element, p (w _m| t _i) represent word w _mBelong to theme t _iProbability.

That is, as two selected differentiating words w _m, w _nAt this probability topic t _iOn probability during greater than preset threshold value λ, assert this two differentiating words w _m, w _nHigher with the correlation degree of this probability topic, increase the numerical value of the component in the text vector this moment, to heighten the similarity of two speech, is convenient to the accuracy of two short text similarities of follow-up raising.

Otherwise heighten similarity and also comprise within the scope of the invention, for example, be V ⁽²⁾In each component increase by a fixed value 0.1, or other numerical value.

Step 2035 utilizes circulation module 6 to judge whether that theme is not selected in addition, if, execution in step 2031, if not, execution in step 204.

Step 204 is utilized computing module 1305, according to the text vector of revised this two short text, utilizes cosine method to calculate short text d ₁And d ₂Similarity, that is,

Sim (d_{1}, d_{2}) = \frac{V^{(1)} \cdot V^{(2)}}{| V^{(1)} | | V^{(2)} |},

Wherein, V ⁽¹⁾, V ⁽²⁾Be d ₁And d ₂Revised text vector, Sim (d ₁, d ₂) be d ₁And d ₂Similarity.

These computational methods only are a kind of example, and the method that other calculate similarity of the prior art also can realize technical scheme of the present invention.

For example, two dot products are calculated similarity, just Sim (d ₁, d ₂)=V ⁽¹⁾V ⁽²⁾

Step 205, if still have short text not traveled through in the database, execution in step 201, after all short texts are all traversed, execution in step 206.

Step 206 is utilized sort module 1306, according to this similarity target short text is classified;

In one embodiment, can specifically adopt the k nearest neighbor grader, classify according to this similarity value.The k nearest neighbor grader (KNN, Http:// en.wikipedia.org/wiki/K-nearest neighbor_algorithm) belonging to prior art, the process of its realization also can be referring to above stated specification.

The short text that is all a class is exported.That is, the short text that is all a class is presented on the space of a whole page, perhaps, as the initial value of subsequent treatment.

Above-mentioned steps (201-205) but iterative cycles repeatedly, to confirm in target short text and the database all similarity degrees between the short texts, use and improve the accuracy of classifying.

The extraction of relevant probability topic, the known technology for this area explains slightly at this.In the prior art, utilize the probability topic model (Probabilistic Topic Model, referring to D.M.Blei, A.Y.Ng ， ﹠amp; , M.I.Jordan.Latent Dirichlet Allocation.Journal of MachineLearning Research, 3,993-1022,2003.) and represent short text and theme, and word in the text and the relation between the theme.This model is based on following hypothesis, and promptly a short text is made up of a plurality of themes, and each theme is the probability distribution (being that certain word belongs to certain theme) on word.This model can be used for generating short text, and the distribution of an at first selected theme plants at this branch then, to carry out the iteration sampling with each word that generates short text, be shown below,

P (W^{(i)}) = Σ_{j = 1}^{Z} P (W^{(i)} | t_{j}) P (T^{(i)} = t_{j}),

Wherein, P (T ⁽ⁱ⁾=t _j) theme t among the document d that generates of expression _jProbability; P (W ⁽ⁱ⁾| t _j) expression word W ⁽ⁱ⁾Belong to theme t _jProbability.

The gibbs sampler method that theme in the probability topic model can propose according to Griffiths and Steyvers is (referring to T.Griffiths and M.Steyvers.Finding scientific topics.The National Academy of Sciences, 101:5228-5235,2004.) extract.This method is at first given each entry a theme at random, then according to following formula to each theme T ⁽ⁱ⁾Carry out the iteration sampling,

P (T^{(i)} = t_{j} | T^{(- i)}, W^{(i)}, D^{(i)}, .) &Proportional; \frac{C_{T^{(i)}}^{W^{(i)}} + β}{\underset{W^{(i)}}{Σ} C_{T^{(i)}}^{W^{(i)}} + N \cdot β} \cdot \frac{C_{D^{(i)}}^{T^{(i)}} + α}{\underset{T^{(i)}}{Σ} C_{D^{(i)}}^{T^{(i)}} + Z \cdot α}

Wherein, P (T ⁽ⁱ⁾=t _j| T ^(-i), W ⁽ⁱ⁾, D ⁽ⁱ⁾.) and represent assessment to give theme t with entry i _jProbability; T ^(-i)Represent the distribution condition of other all entries; ". " represents other information, as W ^(-i), D ^(-i), and super parameter alpha, β.α and β are two parameters that influence subject extraction.

Represent word W ⁽ⁱ⁾From theme T ⁽ⁱ⁾The number of times of middle sampling does not comprise current entry i;

Represent theme T ⁽ⁱ⁾Be endowed text D ⁽ⁱ⁾In the number of times of arbitrary entry, do not comprise current entry i.First fraction is represented word W in the formula ⁽ⁱ⁾Belong to theme t _jProbability; Second fraction represented theme T ⁽ⁱ⁾At text D ⁽ⁱ⁾The probability that plants of theme branch.

By input module 15 appropriate parameter alpha and β can be set, can obtain one group of probability topic T={t ₁..., t _i..., t _Z, each probability topic all has following form: t _i={ t _I1, t _I2..., t _IN, wherein, t _IjBe a probability, be used for weighing j speech at t _iThe middle probability that occurs.

After subject extraction finishes, word w _iBelong to theme t _jProbability can be expressed from the next:

P (w_{i} | t_{j}) = \frac{C_{t_{j}}^{w_{i}} + β}{Σ_{i = 1}^{N} C_{t_{j}}^{w_{i}} + N \cdot β} .

In another embodiment, the present invention also is used in the question answering system, sends it to the expert place of being familiar with this problem domain automatically at a problem that proposes, and promptly recommends the expert.

Part in question answering system is answered the expert that the user can be set to certain field by the keeper.By calculating the similarity between the problem that this problem short text and expert answered, determine which expert it is recommended.Because the inventive method has improved the accuracy that the short text similarity is calculated, and can give the expert who is familiar with this problem domain with question recommending more accurately.Make question answering system as the platform of information interchange, make the distribution of information more accurate.

See also Fig. 3 for carry out the flow chart of question recommending based on probability topic.

Step 301, a problem short text d of terminal 20 transmission ₁Be sent to question answering system server end 10 through communication connection interface 11, the content in data processing equipment 13 these databases 12 of scanning utilizes initialization module 1301 to obtain this short text d ₁And obtain a short text d in the short text of all problems of from database 12, storing ₂, and to these two short text d ₁, d ₂Carry out the initialization of text vector respectively.

Its text vector of difference initialization V ⁽¹⁾, V ⁽²⁾For:

V^{(1)} = {v_{1}^{(1)}, v_{2}^{(1)}, \cdot \cdot \cdot, v_{N}^{(1)}}

V^{(2)} = {v_{1}^{(2)}, v_{2}^{(2)}, \cdot \cdot \cdot, v_{N}^{(2)}},

Wherein, v _i ^(j)Be the w in the dictionary _iAt short text d _jIn weight, this dictionary comprises the word that short text such as all problems and answer comprises in the system; v _i ^(j)Calculate with the TF-IDF method.

Step 302-305 is corresponding with step 202-205, does not give unnecessary details at this.

Step 306 is utilized sort module 1306, according to this similarity Sim (d that calculates ₁, d ₂) obtain and problem short text d ₁K the short text that similarity is the highest obtains and answered this K the expert info of a short text, and it is maximum to judge which expert answers, with problem short text d ₁Be sent to this expert's space of a whole page place, finish.

Perhaps, step 306 can also following mode realize, extracts and problem short text d ₁The short text d of similarity maximum ₂, with short text d ₁Be sent to and answered this problem short text d ₂Expert's space of a whole page place.

In another embodiment, the present invention also can be used for short text clustering.Such as, in question answering system, to search result clustering, the cluster that perhaps checks on one's answers (referring to the reading and browsing display packing and the system of a Chinese patent application #200510130777.0 problem answers).

Cluster is exactly that semantically close text is gathered to together, as a class.In question answering system, the user submits searching request to one time, generally can return a lot of Search Results.Search Results according to Semantic Clustering, is convenient to the user and is browsed.In like manner, the cluster that also can check on one's answers.

At an assigned short text set (as problem in the question answering system or answer), the cluster flow process is as follows:

(1) utilize each short text in the assigned short text set as target short text successively, and circulation execution in step 201-205, calculate the similarity between any two texts in the assigned short text set;

(2) according to the similarity data that calculate all short texts are carried out cluster.Can adopt in one embodiment K-means clustering algorithm of the prior art ( Http:// en.wikipedia.org/wiki/K-means_clustering) realize.Other clustering methods such as Fuzzy C-means, Hierarchical clustering (http://en.wikipedia.org/wiki/Cluster_Analysis#Fuzzy_c-means_clu stering) etc.

In another embodiment, the present invention is applied in the text search field.

With reference to the structural representation of the search engine server of Fig. 4 A, consult the flow chart that carries out the short text retrieval based on probability topic of Fig. 4 B.

Comprise a communication connection interface 41, database 42, data processing equipment 43, display module 44, input module 45 in the search engine server 40.

Communication connection interface 41 is used to receive the search word from a plurality of terminals 20.Data processing equipment 43 is used for the short text data of search word and database 42 are carried out analyzing and processing.Input module 45 is used for input parameter or supervisory instruction.Display module 44 is used for the short text data after handling are by analysis shown.

In the data processing equipment of search engine server, be provided with the system that carries out the short text classification based on probability topic of the present invention.

Step 401, search subscriber is by terminal input one retrieval short text d ₁(being aforementioned search word) to search engine server, the search engine scan database obtains the short text d in the database at random ₂, and to these two short text d ₁, d ₂Carry out the initialization of text vector respectively.

Its text vector of difference initialization V ⁽¹⁾, V ⁽²⁾For:

V^{(1)} = {v_{1}^{(1)}, v_{2}^{(1)}, \cdot \cdot \cdot, v_{N}^{(1)}}

V^{(2)} = {v_{1}^{(2)}, v_{2}^{(2)}, \cdot \cdot \cdot, v_{N}^{(2)}},

Wherein, v _i ^(j)Be the w in the dictionary _iAt short text d _jIn weight, this dictionary comprises all words in the database short text; v _i ^(j)Calculate with the TF-IDF method.

Step 402-405 is corresponding with step 202-205, does not give unnecessary details at this, during circulation execution in step 401, obtains another short text in the database at random, with retrieval short text d ₁Carry out initialized step together; Step 406, after all short texts in the database are all traversed, extract similarity greater than the short text in the database 42 of predetermined value as result for retrieval, just sorting according to similarity is sent to user terminal to show.

Search method of the present invention is based on probability topic and retrieves, and whether reaches threshold value to judge its correlation in terms of content by judging similarity, has improved the accuracy of short text search return results greatly.

In another embodiment, the present invention is applied in mail server to filter spam.

Be provided with the system that carries out the short text classification based on probability topic of the present invention in the data processing equipment of mail server.

With reference to Fig. 5 A is the structural representation of mail server, consults Fig. 5 B for discern the flow chart of spam based on probability topic.

Comprise a communication connection interface 51, email storage database 52, data processing equipment 53, display module 54, input module 55 in the mail server 50.

Communication connection interface 51 and a plurality of terminal 20 transceive data.The mail of newly receiving is stored in the email storage database 52.Data processing equipment 53 is used for the mail short text of email storage database 52 is carried out analyzing and processing.Input module 55 is used for input parameter or supervisory instruction.Display module 54 is used for processed mail short text data are by analysis shown.

In the email storage database 52, comprise the model memory module that stores some spam title short text models, and store the email storage module that newly receives mail.

Step 501, the content in the email storage database of scan mail server is obtained a model short text d from the model memory module ₁, from the email storage module, obtain a mail title short text d more at random or in proper order ₂, and to these two short text d ₁, d ₂Carry out the initialization of text vector respectively.

Its text vector of difference initialization V ⁽¹⁾, V ⁽²⁾For:

V^{(1)} = {v_{1}^{(1)}, v_{2}^{(1)}, \cdot \cdot \cdot, v_{N}^{(1)}}

V^{(2)} = {v_{1}^{(2)}, v_{2}^{(2)}, \cdot \cdot \cdot, v_{N}^{(2)}},

Wherein, v _i ^(j)Be the w in the dictionary _iAt short text d _jIn weight, this dictionary comprises all words in the email storage database mail title; v _i ^(j)Calculate with the TF-IDF method.

Step 502-505 is corresponding with step 202-205.

When step 505 circulation execution in step 501, from the email storage module, obtain another mail title short text and model short text d at random or in proper order ₁Execution in step, the short text in the email storage module is all traveled through, and obtains next model short text again and repeat abovementioned steps from the model memory module.

Step 506, after all short texts and model short text are all traversed, extract with the model memory module in the similarity of short text greater than the mail title short text of predetermined value, shield or do special sign, be spam to show it.Or according to the spam that identifies, the operation that shields.

Should special sign comprise concentrated the demonstration, be designated as highlighted state, be designated as particular color.

Spam recognition methods of the present invention is based on probability topic to be carried out, by judging that similarity to judge its correlation in terms of content, has therefore also improved the accuracy of return results greatly.

The present invention also can be used for the rubbish of portable terminal, the shielding of advertising message.

As shown in Figure 6A for the structural representation of the present invention's one Application Example.

One mobile communication system comprises a portable terminal 20, it for example is a mobile phone, portable terminal 20 is connected with a short message server 60 by a fire compartment wall 61, the system that carries out short text classification based on probability topic of the present invention is arranged in this fire compartment wall, and the note that is used for short message server 60 is sent to portable terminal 20 is tackled, screened.Particularly the keeper is kept at model memory module in this fire compartment wall with the typical model of the note of desire interception.

Be depicted as the schematic flow sheet of the present invention's one Application Example as Fig. 6 B.

Step 601, fire compartment wall 61 receives the current short message text d that sends from short message server 60 ₁, and short messages stored model in the scanning model memory module, and from the model memory module, obtain a model short text d ₂, and to these two short text d ₁, d ₂Carry out the initialization of text vector respectively.

Its text vector of difference initialization V ⁽¹⁾, V ⁽²⁾For:

V^{(1)} = {v_{1}^{(1)}, v_{2}^{(1)}, \cdot \cdot \cdot, v_{N}^{(1)}}

V^{(2)} = {v_{1}^{(2)}, v_{2}^{(2)}, \cdot \cdot \cdot, v_{N}^{(2)}},

Wherein, v _i ^(j)Be the w in the dictionary _iAt short text d _jIn weight, this dictionary comprises all words in note set of prior structure, these notes comprise normal note and refuse messages; v _i ^(j)Calculate with the TF-IDF method.

Step 602-605 is corresponding with step 202-205.

In case find current short message text d ₁Be higher than a predetermined value with the similarity of arbitrary model short text, tackle this short message text d ₁, do not mail to portable terminal 20.

If do not find current short message text d yet ₁Be higher than a predetermined value with the similarity of whole model short texts, current short message text is sent to portable terminal 20.

By said method can be tight interception identical in form with the model short text or on meaning relevant note, make that the examination of refuse messages is tighter.

Perhaps, fire compartment wall 61 also can be connected with a Internet web server 62, and the user of this portable terminal 20 can login this server in station 62 and bind the number of this portable terminal 20.The user can pass through this server in station 62, and the model short text in this fire compartment wall 61 is set.That is the short message type that these portable terminal 20 needs interceptions are set of personalization.Make the interception and the clearance of note, satisfy user's demand more.

In one embodiment of the invention, the inventive method and system can be applied to the problem classification in the question answering system.Problem is a kind of special short text, existing research work mostly at the classification of type of problem (as X.Li, D.Roth:Learning Question Classifiers.In:Proceedings ofthe 19th International Conference on Computational Linguistics (2002), D.Zhang, W.S.Lee:Question Classification using Support VectorMachine.In:Proceedings of the 26th annual international ACM SIGIRconference on Research and development in informaion retrieval (2003)), just classify according to the potential answer type of problem, such as the personage, place etc.But a lot of on-line systems (as based on interactively question answering system) need be organized the problem of magnanimity according to the theme of problem, promptly according to the content of problem it are assigned to corresponding classification, as computer, and education, motion etc.In order to realize that problem is classified, many graders (k nearest neighbor grader for example, KNN) need similarity between the computational problem, and the short text calculation of similarity degree method based on probability topic that the present invention proposes can come with the machine learning method of this classics problem is classified.The inventive method also can be verified by classification performance.

In one embodiment of the invention, having collected two data sets tests.A data set comprises 1120 problems in 32 classifications from BuyAns, and the number of problem from 14 to 108 does not wait in each classification; Answers comprises 2400 problems in 11 classifications, and it is minimum 100 that each classification comprises, maximum 400 problems.Behind the stop words in removing problem, all words are converted to prototype (at English word, as removing forms such as past tense).The English word base that each problem comprises originally is less than 10.In each data centralization, 70% the problem randomly drawed is as training set, and remaining is tested.Average result is write down in retest 10 times at last.Because the gibbs sampler algorithm can both obtain target distribution later 300 times in circulation usually, so cycle-index is made as 300.α and β rule of thumb are made as 50/Z and 0.01 respectively.

In one embodiment of the invention, the performance of following three kinds of graders: (1) KNN; (2) SVM; (3) KNN combines (KNN_TBS) with the short text similarity calculating method based on probability topic of the present invention.Therefore KNN and SVM are based on the TF-IDF vector representation, as the comparison of the inventive method.In KNN and KNN_TBS, neighbours' quantity is made as 30.SVM is based on libSVM instrument (C.Chang andC.Lin, LIBSVM:a library for support vector machines, 2001.Softwareavailable at:http: //www.csie.ntu.edu.tw/cjlin/libsvm/), in experiment, adopted the linear kernel function of SVM.Basic skills such as accuracy, recall rate and F1 are used to the above three kinds of performances that grader is classified to problem of evaluate application.Calculated the little average and grand average of accuracy, recall rate and F1 when as follows.

In the methods of the invention, the number of topics of extraction is an important parameter, can optimize according to classification performance.Shown in Fig. 7 A, 7B, be on two data sets, get little average (MicroF1) and the grand mean value (MacroF1) of the F1 that the different themes number obtains. On two data sets of Answer, when number of topics was respectively 160 and 120, classification performance was the highest.

Equally, λ also is an important parameter, and it determines whether distinguish word set for two is correlated with by probability topic.If select too muchly, might neglect the true association between the branch word set; Select too smallly, then can misjudge the association of distinguishing between the word set.Shown in Fig. 8 A, 8B, be on two data sets, get little average (MicroF1) and the grand mean value (MacroF1) of the F1 that different λ values obtain.

Shown in table 1, table 2, be respectively the test result of two data sets.Wherein, the number of topics of the extraction of BuyAns data set is made as 160, and the λ value is made as 0.05; The number of topics of the extraction of Answer data set is made as 120, and the λ value is made as 0.15.As can be seen from the table, in conjunction with method of the present invention, compare with traditional TF-IDF method, classify accuracy has improved 10% on the BuyAns data set, compare with SVM, and majority also increases. On the Answer data set, compare with traditional TF-IDF method, the little average and grand average mark of F1 you can well imagine high 12% and 17%, compares with SVM, has also improved 10%.

The problem class test result of table 1 BuyAns data set

The problem class test result of Answer data set

The foregoing description only for explanation the present invention's usefulness, is not considered as limitation of the present invention, and concrete protection range please be looked the appended claim book and be as the criterion.

Claims

1, a kind of method of carrying out the short text classification based on probability topic is applied in the data processing equipment of question answering system, is used for according to the similarity between short text short text being classified, and it is characterized in that, comprises the steps:

According to this similarity target short text is classified.

2, the method for claim 1 is characterized in that, obtains initialized text vector by scanning this two short text:

\{\begin{matrix} V^{(1)} = {v_{1}^{(1)}, v_{2}^{(1)}, . . ., v_{N}^{(1)}} \\ V^{(2)} = {v_{1}^{(2)}, v_{2}^{(2)}, . . ., v_{N}^{(2)}} \end{matrix},

Wherein, v ^(j)Be text d _jInitialized text vector, v _i ^(j)Be i speech w in the dictionary _iAt short text d _jIn weight, this dictionary comprises all words in all short texts in the question answering system.

3, method as claimed in claim 1 or 2 is characterized in that, describedly further comprises by scanning the step that this two short text obtains the differentiating words of this two short text respectively:

This two short text is compared, obtain the differentiating words set of each short text respectively, described differentiating words set is:

Dist (d_{1}) = {w | w &Element; d_{1}, w &NotElement; d_{2}}

Dist (d_{2}) = {w | w &Element; d_{2}, w &NotElement; d_{1}}

Wherein, d _iBe short text, Dist (d _i) be short text d _iDifferentiating words set, w is a word.

4, the method for claim 1 is characterized in that, also comprises the step of utilizing the gibbs sampler method to extract this probability topic.

5, method as claimed in claim 3 is characterized in that, the described differentiating words and the correlation degree of a probability theme when all being higher than threshold value when this two short text further comprises according to the step of the text vector of this this two short text of correlation degree correction:

Step 31 is chosen a probability topic t _i, search for Dist (d respectively ₁), Dist (d ₂) at this probability topic t _iThe differentiating words of last probability maximum;

Step 32 judges that selected two differentiating words are at probability topic t _iOn probability whether all greater than threshold value, if all greater than, execution in step 33; Otherwise, execution in step 34;

Step 33, according to following formula to V ⁽¹⁾, V ⁽²⁾Revise:

\{\begin{matrix} v_{n}^{(1)} = v_{n}^{(1)} + v_{n}^{(2)} \times P (w_{n} | t_{i}) \\ v_{m}^{(2)} = v_{m}^{(2)} + v_{m}^{(1)} \times P (w_{m} | t_{i}) \end{matrix},

Wherein, V _n ⁽¹⁾Represent V ⁽¹⁾In n element, V _n ⁽²⁾Represent V ⁽²⁾In n element, p (w _m| t _i) represent word w _mBelong to theme t _iProbability, word w _mBe short text d ₁In at probability topic t _iThe word of last probability maximum, w _nBe short text d ₂In at probability topic t _iThe word of last probability maximum.

Step 34 judges whether that theme is not selected in addition, if execution in step 31 if not, is carried out the step that described text vector according to revised this two short text calculates the similarity between two short texts.

6, method as claimed in claim 5 is characterized in that, the step that described text vector according to this two short text calculates the similarity between two short texts comprises:

Sim (d_{1}, d_{2}) = \frac{V^{(1)} \cdot V^{(2)}}{| V^{(1)} | | V^{(2)} |},

Wherein, Sim (d ₁, d ₂) be d ₁And d ₂Similarity, V ⁽¹⁾, V ⁽²⁾Be d ₁And d ₂Revised text vector.

7, the method for claim 1 is characterized in that, the step of described classification utilizes k nearest neighbor grader or centroid method to realize.

8, a kind of system that carries out the short text classification based on probability topic is applied in the data processing equipment of question answering system, is used for according to the similarity between short text short text being classified, and it is characterized in that, comprising:

Computing module calculates similarity between two short texts according to the text vector of revised this two short text;

Sort module is classified to target short text according to similarity.

9, system as claimed in claim 8 is characterized in that, described initialization module by the initialized text vector that scans this two short texts acquisition is:

\begin{matrix} V^{(1)} = {v_{1}^{(1)}, v_{2}^{(1)}, . . ., v_{N}^{(1)}} \\ V^{(2)} = {v_{1}^{(2)}, v_{2}^{(2)}, . . ., v_{N}^{(2)}} \end{matrix},

Wherein, v ^(j)Be text d _jInitialized text vector, w _i ^(j)Be i speech w in the dictionary _iAt short text d _jIn weight, this dictionary comprises all words in all short texts in the question answering system.

10, system as claimed in claim 8 or 9 is characterized in that described differentiating words identification module further comprises: contrast module and differentiating words set acquisition module;

This contrast module is used for this two short text is compared;

This differentiating words set acquisition module is used for obtaining respectively the differentiating words set of each short text, and described differentiating words set is:

Dist (d_{1}) = {w | w &Element; d_{1}, w &NotElement; d_{2}}

Dist (d_{2}) = {w | w &Element; d_{2}, w &NotElement; d_{1}}

11, system as claimed in claim 8 is characterized in that, also comprises a probability subject extraction module, and this probability topic abstraction module utilizes the gibbs sampler method to extract this probability topic.

12, system as claimed in claim 10 is characterized in that, described judgment amendment module further comprises:

Module chosen in theme, is used to choose a probability topic t _i, search for Dist (d respectively ₁), Dist (d ₂) at this probability topic t _iThe speech of last probability maximum;

Judge module is used to judge that selected two differentiating words are at probability topic t _iOn probability whether all greater than threshold value, if all greater than, carry out correcting module; Otherwise, carry out the circulation module;

Correcting module is used for according to following formula V ⁽¹⁾, V ⁽²⁾Revise:

\begin{matrix} v_{n}^{(1)} = v_{n}^{(1)} + v_{n}^{(2)} \times P (w_{n} | t_{i}) \\ v_{m}^{(2)} = v_{m}^{(2)} + v_{m}^{(1)} \times P (w_{m} | t_{i}) \end{matrix},

The circulation module is used to judge whether also have theme not to be selected, if, carry out theme and choose module, if not, carry out computing module.

13, method as claimed in claim 12 is characterized in that, described computing module utilizes following mode to realize:

Sim (d_{1}, d_{2}) = \frac{V^{(1)} \cdot V^{(2)}}{| V^{(1)} | | V^{(2)} |},

14, method as claimed in claim 8 is characterized in that, described sort module utilizes k nearest neighbor grader or centroid method to realize.

15, a kind of method of carrying out the short text retrieval based on probability topic is applied to it is characterized in that in the search engine server, comprises the steps:

Step 2 is obtained a short text from the database of search engine server;

16, a kind of method of the identification spam based on probability topic is applied to it is characterized in that in the mail server, comprises the steps:

Step 2 is obtained a short text from the database of mail server;

17, a kind of method based on shielding rubbish, advertising message in the portable terminal of probability topic, be applied in the fire compartment wall of mobile communication system, store the model short text in the fire compartment wall, mobile communication system comprises short message server, fire compartment wall, user terminal, it is characterized in that, comprise the steps:

Step 2, fire compartment wall are obtained a model short text;

Step 3 according to this short message text and this model short text, is obtained its initialized text vector respectively;