CN107315731A

CN107315731A - Text similarity computing method

Info

Publication number: CN107315731A
Application number: CN201610268995.9A
Authority: CN
Inventors: 俞晓光; 陶玮
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2016-04-27
Filing date: 2016-04-27
Publication date: 2017-11-03

Abstract

A kind of Text similarity computing method, including：Step (S1), according to the default classification scheme classified based on user view, according to history text, the intention assessment disaggregated model for the phrase being directed in the history text is created, the intention assessment disaggregated model reflects probability of the phrase under the classification scheme；Step (S2), using as the object text segmentation of Similarity Measure object be object phrase corresponding with the phrase in above-mentioned intention assessment disaggregated model, based on the intention assessment disaggregated model, phase adduction normalizing is carried out to the probability of the object phrase, the intent classifier vector of the object text is obtained, the intent classifier vector reflects probability of the object text under the classification scheme；And step (S3), according to intent classifier vector, the similarity of two object texts is asked for using Method of Cosine.

Description

Text similarity computing method

Technical field

The present invention relates to a kind of Text similarity computing method, intention assessment classification mould is more particularly to utilized The Text similarity computing method of type.

Background technology

Text similarity, that is, calculate the whether similar algorithm of two problems, and it is most basic as one kind Algorithm has a wide range of applications, while being also search engine, text sequence, related question excavation etc. one The core of series of problems.It is a series of to ask if the similarity between text two-by-two can be calculated effectively Topic can also be solved therewith.

Intention assessment, that is, recognize a kind of intention of behavior.For example, in question answer dialog, quizmaster is every Word all carries certain intention, and answer party is answered according to the intention of other side.Relevant issues are being searched It is widely used under the scenes such as index is held up, chat robots.Especially, in chat robots, meaning Figure identification is the nucleus module of whole system.When the problem of answering user, all problems are drawn in advance It is divided into theme is classified by the intention of user one by one classification scheme (with company's customer service and user Exemplified by dialogue, a theme is exactly a service point.For example, relevant goods return and replacement, relevant delivery address Deng).User is putd question to every time, and all problem is mapped in some theme, particular topic pair is provided afterwards The answer answered.

Machine learning is exactly the science of an artificial intelligence, and the main study subject in the field is artificial intelligence Can, in particular how improve the performance of specific algorithm in empirical learning.Common machine learning method Supervised learning, semi-supervised learning and unsupervised learning can be divided into.

So-called supervised learning, exactly goes out a function, when new from given training data focusing study When data arrive, it can be predicted the outcome according to the function.The training set requirement of supervised learning is to include Input and output, it may also be said to be feature and target.Target in training set can be marked in advance.

So-called topic model is exactly that the method that theme is modeled is implied to text.Given training corpus, Training corpus is automatically divided into different themes, which theme new language material belongs to for predicting.

LR (Logistic regression) is logistic regression algorithm, is a kind of conventional supervised learning Algorithm.

Bag of words (bag of words), are a kind of document representation methods.

For example, there is a dictionary：

{″John″：1, " likes "：2, " to "：3, " watch "：4, " movies "：5, " also "： 6, " football "：7, " games "：8, " Mary "：9, " too "：10}

One text：

John likes to watch movies.Mary likes too.

According to existing dictionary, following vector can be converted the text to：

[1,1,1,1,1,0,0,0,1,1]

Wherein, 1 shows that the word in dictionary occurred, and 0 represents do not occur.

The existing method for calculating text similarity is a lot, for example, convert the text to ask after term vector to Cos (cosine) angle of amount, or BM25 (BM stands for Best Matching, optimal With criterion), LCS (Longest Common Subsequence, longest common subsequence) etc. Series of algorithms.

However, the existing algorithm for calculating text similarity can only often reflect text in terms of some Similarity, and algorithm be substantially it is related to text literal strong (close).On the one hand, when two Text matches are to core word or when matching general stop word, and the similarity that algorithm is provided is identical , it is impossible to make a distinction；On the other hand, if two texts contain synonym, although expression is One meaning, but be due to literal inconsistent and cause similarity very low.General topic model due to Each theme is the generation of program automatic cluster, therefore, on the one hand, the theme of generation is often people It can not understand, on the other hand, some incoherent problems can be divided into a theme and caused Effect is extremely difficult to be expected.

In addition, generally requiring in actual use while being merged to multiple similarity algorithms.Moreover, Effect also is difficult to satisfactory.

The content of the invention

The present invention is in view of prior art substantially all has stronger correlation and nothing with the literal of text Method is real to be made from the semantic level of text judging the problem of text similarity etc. is above-mentioned such, Its object is to provide a kind of avoid to calculate the disadvantage of similarity according to literal completely in the prior art The degree of accuracy at end is higher and the more preferable Text similarity computing method of effect.

The Text similarity computing method of one aspect of the present invention, step (S1), according to default The classification scheme classified based on user view, according to history text, is created and is directed to the history text In phrase intention assessment disaggregated model, the intention assessment disaggregated model reflects the phrase in institute State the probability under classification scheme；Step (S2), will be used as the object text point of Similarity Measure object Object phrase corresponding with the phrase in above-mentioned intention assessment disaggregated model is segmented into, based on the meaning Figure identification disaggregated model, phase adduction normalizing is carried out to the probability of the object phrase, obtains described The intent classifier vector of object text, the intent classifier vector reflects the object text at described point Probability under class theme；And step (S3), according to intent classifier vector, utilize Method of Cosine Ask for the similarity of two object texts.

Text similarity computing method according to an aspect of the present invention, the formula of the Method of Cosine For：

Wherein, cos θ represent similarity, and i represents the classification scheme number of the intent classifier vector, its Value is 1 to n positive integer, and A represents the first object text, and B represents the second object text, A_i、B_i The institute of the first object text or the second object text under current class theme is represented respectively State probability.

Text similarity computing method according to an aspect of the present invention, the intention assessment classification mould Type is created by bag of words method and combines logistic regression algorithm to realize.

Text similarity computing method according to an aspect of the present invention, the classification scheme is customer service With the service point of user session.

Text similarity computing method according to an aspect of the present invention, the history text is customer service With the text in the history consulting daily record of user session.

Text similarity computing method according to an aspect of the present invention, the phrase is as needed A part of phrase filtered out from the history text.

Text similarity computing method according to an aspect of the present invention, the classification scheme number is institute State the dimension of intent classifier vector.

Text similarity computing method according to an aspect of the present invention, the probability is the intention The numerical value of class vector.

In summary, according to the above-mentioned technical proposal of the Text similarity computing method of the present invention, realize A kind of degree of accuracy is higher and effect more preferable Text similarity computing method, it is to avoid in the prior art Completely according to it is literal to calculate similarity the drawbacks of.

Brief description of the drawings

Fig. 1 is the general block diagram of the Text similarity computing method of the present invention.

The step of Fig. 2 is the establishment intention assessment disaggregated model of the Text similarity computing method of the present invention S1 flow chart.

Fig. 3 is the relevant used equipment of the control method of the intelligent terminal access intelligence spot net of the present invention Process schematic diagram during access.

Embodiment

The present invention is the Text similarity computing method that make use of intention assessment disaggregated model, according to prior Each ready-portioned classification scheme, it is intended that identification disaggregated model can map the text to corresponding classification So as to therefrom obtain the information of its semantic level on theme.Text similarity meter is carried out on this basis Calculate, so as to obtain more preferable effect.

For the object, technical solutions and advantages of the present invention are more clearly understood, below in conjunction with specific reality Example is applied, and referring to the drawings, the present invention is described in detail.

Fig. 1 is the general block diagram of the Text similarity computing method of the present invention.As shown in figure 1, above-mentioned Text similarity computing method includes：Create the step S1 of intention assessment disaggregated model；Obtain object text The step S2 of this intention assessment class vector；And calculate the step S3 of similarity.

As shown in Fig. 2 in the step S1 for creating intention assessment disaggregated model, first, setting in advance Surely the classification scheme (step S1-1) classified by the intention of user.With company's customer service and user couple Exemplified by words, a classification scheme is exactly a service point, and each problem (text) of user can be with Corresponding service point correspondence in these service points.For example, it is assumed herein that being divided into 3 kinds of classification schemes： " relevant freight charges ", " relevant goods return and replacement ", " relevant delivery address ".

Then, obtain history text and (by taking company's customer service and user session as an example, then seek advice from day for history Text in will), and history text is subjected to cutting word, to determine modeling phrase (step S1-2). That is, by taking above-mentioned Bag of words (bag of words) method as an example, it can be cut into and the dictionary number in bag of words According to corresponding phrase one by one, modeling phrase is used as.Here, can not be all conducts of all phrases Modeling phrase, but actually useful a part of phrase can be filtered out as needed as modeling Use phrase.

Then, for each identified modeling phrase, according to above-mentioned default classification scheme, profit With known algorithm (for example, using Bag of words (bag of words) method, every text is converted to Vector, is that logistic regression algorithm carries out model training using LR (Logistic regression) then), Create the intention assessment disaggregated model (step S1-3) for each phrase.

Here, the output of intention assessment disaggregated model is a vector (also referred to as theme vector), to The classification scheme number of the dimension of amount and above-mentioned division is consistent (in this example, being " 3 "), per one-dimensional Numerical value represent text or phrase belongs to the probability of corresponding classification scheme, probability is bigger to represent text This or phrase are more likely to belong to current class theme, and vectorial all dimensions add up to 1.

It is following【Table 1】, there is shown one of the intention assessment disaggregated model for phrase created shows Example.(here, table 1 indicates an example, numerical value not actual numerical value.Moreover, the intention assessment Disaggregated model is a kind of existing machine learning algorithm, more than one, different its algorithm logic of algorithm It is different)

【Table 1】

	Relevant freight charges	Relevant goods return and replacement	Relevant delivery address
				Thing	0.33	0.33	0.33
Delivery	0.45	0.10	0.45
				Bag postal	0.80	0.10	0.10
Where	0.15	0.05	0.80
				Freight charges	0.80	0.10	0.10
···	···	···	···

Fig. 3 is the intention assessment classification of the acquisition object text of the Text similarity computing method of the present invention The step S2 of vector flow chart.

As shown in figure 3, in the step S2 of intention assessment class vector of object text is obtained, it is first First, the object text (step S2-1) as the object for carrying out similarity assessment is obtained.

Then, using the intention assessment disaggregated model of above-mentioned establishment, the intention for obtaining the object text is known Not vectorial (step S2-2).Specifically, it is intended that the input of identification disaggregated model is the object text, The output of intention assessment disaggregated model is a vector (also referred to as theme vector), vectorial dimension with The classification scheme number of above-mentioned division is consistent (in this example, being " 3 "), is represented per one-dimensional numerical value Text or phrase belong to the probability of corresponding classification scheme, and probability is bigger represents text or phrase more Current class theme is likely to belong to, vectorial all dimensions add up to 1.

For example, it is assumed that object text is " who goes out thing freight out ", then according to Bag of words (bag of words) method carries out cutting word, and cutting word is " thing ", " delivery ", " freight charges ", " who goes out ".Then, According to above-mentioned【Table 1】The intention assessment disaggregated model for phrase, utilize the side of phase adduction normalizing Method is vectorial come the intention assessment for obtaining the object text, i.e., object text belongs to corresponding each classification master Probability under topic.For example, it is as follows specifically to calculate (phase adduction normalizing algorithm).

The first step, calculates the probability that text belongs to each classification：

Belong to the probability P 1=0.33+0.45+0.80 of classification scheme 1 (such as " about freight charges ")；

Belong to the probability P 2=0.33+0.10+0.10 of classification scheme 2 (such as " about goods return and replacement ")；

·····

Belong to classification scheme n probability P n=xxx+xxx+xxx；

Second step, normalizes each probability：

Belong to final probability=P1/ (P1+P2++Pn) of classification scheme 1；

Belong to final probability=P2/ (P1+P2++Pn) of classification scheme 2；

·····

Belong to classification scheme n final probability=Pn/ (P1+P2++Pn)；

Here, an also simply example, numerical value not actual numerical value.Moreover, this also and it is not exclusive calculate Method.

Then, judge whether the object text that carry out similar assessment obtains and finish, be judged as it is not complete When finishing ("No"), return to step S2-1 obtains next object text；It is being judged as finishing ("Yes") When, into step S3.

It is following【Table 2】, there is shown an example of the intent classifier vector of acquired object text. (here, table 2 is also an example, numerical value not actual numerical value, and above-mentioned【Table 1】Not Matching completely)

【Table 2】

In the step S3 for calculating similarity, two texts are asked for according to following cosine formula (formula 1) This similarity.

Wherein, cos θ represent similarity, and i represents the dimension i.e. classification scheme number of vector, and its value is 1 To n positive integer (in this example, n=3), A represents the first object text, and B represents the second object text This, A_i、B_iRepresent respectively the first object text or the second object text under current class theme to Numerical quantity is probability.

So, according to above-mentioned【Table 2】, by above formula try to achieve " who goes out thing freight out " with The similarity of " commodity bundle postal " be 0.9967, and " who goes out thing freight out " with " thing from Where deliver " similarity be 0.0819.

It can be seen that, when text representation be same intention when just can preferably reflect its similarity also compared with It is close.Conversely, when being intended to difference farther out, text similarity is also low.Moreover, similarity and word Face relation is not literal closer to more similar less.

Thus, the similarity that the present invention is calculated and tried to achieve by the above method is to text semantic rank Understand, more general similarity calculating method abstraction level is higher.It is not simple according to text It is literal whether unanimously to seek similarity, but whether two texts are in table from the point of view of being really intended to according to text State same implication.Compared to general literal similarity algorithm, it is to avoid what is be mentioned above is complete According to the drawbacks of literal calculating similarity.For general topic model, due to intention assessment classification mould Type has higher accuracy rate, and effect is also more excellent.

Particular embodiments described above, is carried out to the purpose of the present invention, technical scheme and beneficial effect It is further described, should be understood that the specific example that the foregoing is only of the invention, It is not intended to limit the invention.Any modification within the spirit and principles of the invention, being made, Equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims

1. a kind of Text similarity computing method, including：

Step (S1), according to the default classification scheme classified based on user view, according to history Text, creates the intention assessment disaggregated model for the phrase being directed in the history text, the intention assessment Disaggregated model reflects probability of the phrase under the classification scheme；

Step (S2), will be to know with above-mentioned intention as the object text segmentation of Similarity Measure object The corresponding object phrase of the phrase in other disaggregated model, based on the intention assessment disaggregated model, Phase adduction normalizing is carried out to the probability of the object phrase, the intention point of the object text is obtained Class vector, the intent classifier vector reflects probability of the object text under the classification scheme； And

Step (S3), according to intent classifier vector, two objects are asked for using Method of Cosine The similarity of text.

2. Text similarity computing method according to claim 1, it is characterised in that

The formula of the Method of Cosine is：

<mrow> <mi>cos</mi> <mi>&theta;</mi> <mo>=</mo> <mfrac> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <mrow> <mo>(</mo> <msub> <mi>A</mi> <mi>i</mi> </msub> <mo>&times;</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <msqrt> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msup> <mrow> <mo>(</mo> <msub> <mi>A</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </msqrt> <mo>&times;</mo> <msqrt> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msup> <mrow> <mo>(</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </msqrt> </mrow> </mfrac> </mrow>

3. Text similarity computing method according to claim 1, it is characterised in that

The intention assessment disaggregated model is created by bag of words method and combines logistic regression algorithm coming Realize.

4. Text similarity computing method according to claim 1, it is characterised in that

The classification scheme is the service point of customer service and user session.

5. Text similarity computing method according to claim 1, it is characterised in that

The history text is the text in customer service and the history consulting daily record of user session.

6. Text similarity computing method according to claim 1, it is characterised in that

The phrase is a part of phrase filtered out as needed from the history text.

7. Text similarity computing method according to claim 1, it is characterised in that

The classification scheme number is the dimension of the intent classifier vector.

8. Text similarity computing method according to claim 1, it is characterised in that

The probability is the numerical value of the intent classifier vector.