CN106980870A

CN106980870A - Text matches degree computational methods between short text

Info

Publication number: CN106980870A
Application number: CN201611256117.1A
Authority: CN
Inventors: 王宇; 华锦芝; 郑建宾; 张琦; 冯亮
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2017-07-25
Anticipated expiration: 2036-12-30
Also published as: CN106980870B

Abstract

The present invention relates to the text matches degree computational methods between a kind of short text, comprise the following steps：Participle is carried out to the first text, the second text, to obtain the segmentation sequence of first, second text respectively；The matching sequence of the first text, the second text is determined respectively；Determine location interval of the i+1 character and i-th of character in the matching sequence of the first text in the second text；The identical characters matching degree between first, second text is calculated based on each position interval, using phrase similarity computational methods；Calculate the matching sequence of the first text and the editing distance matched between sequence of the second text；And, based on the identical characters matching degree between first, second text, editing distance and the respective string length of first, second text, calculate the text matches degree between first, second text.Should in this way, the matching accuracy rate not only to text is higher, and robustness is good, it may have higher Sensitivity and Specificity.

Description

Text matches degree computational methods between short text

Technical field

The present invention relates to text matching techniques field, more specifically to the text matches degree between a kind of short text Computational methods.

Background technology

The Text similarity computing method of current main flow include it is following several, but all with have the shortcomings that it is different degrees of.

First, common Jaro-Winkler computational methods are adapted to short character, and character prefix is identical bonus point, but does not account for It is spaced between similar character, therefore (short text Similarity Measure to be matched is high, but actually not similar short for the matching of counter-example refusal Text) effect is poor.

2nd, most long public word string computational methods ensure relatively orderly to character, and counter-example refusal matching effect is preferable, but to length Degree and interval are sensitive, and (character similarity is high between short text to be matched, and actual really similar short essay for positive example matching Originally) effect is poor.

3rd, the similarity calculating method based on editing distance is sensitive to character length and position sequence, and big counter-example of difference is refused Exhausted matching effect is preferable but poor to the refusal matching effect of discrepant positive example and similar counter-example.

4th, cosine similarity method is sensitive to length and interval, refuses matching effect preferably to some counter-examples, but do not have Consider position sequence, sequence discrepant counter-example refusal matching effect in position is poor.

5th, phrase similarity method considers the interval of identical characters, and some counter-example matching effects preferably, but are not accounted for Position sequence, the discrepant counter-example refusal matching effect of position sequence is poor.

The content of the invention

It is an object of the present invention to provide it is a kind of have can overcome to a certain extent drawbacks described above, short text it Between text matches degree computational methods.

To achieve the above object, a kind of technical scheme of present invention offer is as follows：

A kind of text matches degree computational methods between short text, comprise the following steps：A), to the first text, the second text This progress participle, to obtain the segmentation sequence of first, second text respectively；B) segmentation sequence point, based on first, second text The matching sequence of the first text, the second text is not determined；Wherein, the matching sequence of the first text represent it is in the first text, It is being constituted with a certain character identical character in the second text and arranged by sequencing of the character in the first text Sequence, the matching sequence of the second text represent it is in the second text, with a certain character identical character in the first text Sequence that is being constituted and being arranged by sequencing of the character in the second text；C) the matching sequence of the first text, is browsed Row, determine location interval of the i+1 character and i-th of character in the matching sequence of the first text in the second text；d)、 The identical characters matching degree between first, second text is calculated based on each position interval, using phrase similarity computational methods； E) editing distance matched between sequence of the matching sequence with the second text of the first text, is calculated；And f), based on first, Identical characters matching degree, editing distance and the respective string length of first, second text between second text, are calculated Text matches degree between first, second text.

Preferably, this method also includes text matches degree amendment step：Define text similarity threshold value；It is determined that in the first text In this matching sequence, the character length for matching sequence identical start-up portion with the second text；And it is similar based on text Degree threshold value, the character length of identical start-up portion are modified to text matches degree.

Preferably, in step b)：For each character in the first text, respectively only with not completed in the second text The character matched somebody with somebody is matched, and to complete matching, the character record of order at first in the second text in the matching of the second text In sequence.

Another object of the present invention is to provide the short text matching process that a kind of matching accuracy rate is higher.

To achieve the above object, another technical scheme of present invention offer is as follows：

A kind of short text matching process, for found out from a short text set match with text to be matched one or Multiple short texts, matching process includes：Calculated respectively using the above method text to be matched with it is each short in short text set Text matches degree between text；By the text matches degree highest short essay between text in short text set and to be matched Originally it is defined as the short text matched.

Text matches degree computational methods between short text provided by the present invention, can more accurately calculate short text it Between matching degree, matching accuracy rate that should be in this way not only to text is higher, and robustness is good, it may have higher Sensitivity and Specificity.Short text matching process provided by the present invention, positive example matching and the matching of counter-example refusal all have good Good effect, so as to have higher matching accuracy rate compared to prior art.

Brief description of the drawings

Fig. 1 shows the flow chart of the text matches degree computational methods between the short text that first embodiment of the invention is provided.

Fig. 2 shows the Text similarity computing method and Text similarity computing method of the prior art according to the present invention Technical indicator contrast.

Embodiment

As shown in figure 1, first embodiment of the invention provides the text matches degree computational methods between a kind of short text, it is wrapped Include following steps rapid.

Step S10, to the first text, the second text carry out participle, to obtain the participle sequence of first, second text respectively Row.

As an example, divided respectively for the two text A (the first text) and B (the second text) to be matched Word, due to being to be directed to short text, can use the segmenting method of full cutting and word-based frequency statisticses.Participle is obtained after participle Sequence A=" a₀a₁...a_m-1", B=" b₀b₁...b_n-1", wherein m≤n, A represents shorter character string text, and B represents longer Character string text.

Step S11, the segmentation sequence based on first, second text determine the matching sequence of the first text, the second text respectively Row.

Wherein, the matching sequence of the first text represents in the first text, identical with a certain character in the second text Character constituted and the sequence that is arranged by sequencing of the character in the first text, the matching sequence of the second text Represent in the second text, with the first text in a certain character identical character constituted and by character second text Sequencing in this and the sequence arranged.

Continue the example presented above, if i-th of character a in text A_iThere is identical characters b in text B_kIt is matching, then count Calculate position of the character in text B:

C (A, i, B)=k | b_k=a_i, k=0,1 ..., n-1 }, wherein i=0,1 ..., m-1.

Under preferable case, for each character in first text, only matched respectively with not completed in the second text Character matched, and to complete matching, order character record at first in the second text in the matching sequence of the second text In row, and then form the whole matching sequence of the second text.

Step S12, the matching sequence for browsing the first text, determine the i+1 character in the matching sequence of the first text With location interval of i-th of character in the second text.

Step S13, calculate based on each position interval, using phrase similarity computational methods between first, second text Identical characters matching degree.

Continue the example presented above, if there is i-th of character and i+1 character word corresponding with text B respectively in text A Symbol matches, then between the position of calculating i-th of character that the match is successful and i+1 character in text B between correspondence character Every：

Δ (A, i+1, i, B)=C (A, i+1, B)-C (A, i, B)

And then, the identical characters matching degree between first, second text, which can be used, is calculated as below formula：

Wherein, N represents that the character that the match is successful between text A and text B is always individual I+1 character and i-th of character in number, the matching sequence of Δ (A, i+1, i, B) the first text of expression is in the second text Location interval between correspondence character.

It will be appreciated by those skilled in the art that the location interval between meter and character, can improve the effect of counter-example refusal matching.

Step S14, calculate the first text matching sequence and the second text the editing distance matched between sequence.

Specifically, for all character sets that the match is successful in text A and text B, according to elder generation of each character in A Order, constitutes character string ms afterwards₁；Meanwhile, according to sequencing of each character in B, constitute character string ms₂。

And then, character string ms₁With character string ms₂Between editing distance (Levenshtein distances) can be expressed as：

T=d (ms₁,ms₂), wherein, d represents to ask for the editing distance between character string.

It will be understood by those skilled in the art that using editing distance as text matches degree the calculating factor, it can be ensured that word Accord with length and position sequence sensitiveness so that the big counter-example refusal matching effect of some differences is preferable.

Step S15, the identical characters matching degree based between first, second text, editing distance and first, second The respective string length of text, calculates the text matches degree between first, second text.

As an example, the calculation formula of the text matches degree between first, second text is：

Wherein, m is the identical characters matching degree between first, second text, and t is The matching sequence of first text and the editing distance matched between sequence of the second text, | S_A|、|S_B| it is respectively first, second The string length of text.

According to the improvement embodiment of the first embodiment, this method also includes a text matches degree amendment step：

Define text similarity threshold value；

It is determined that in the matching sequence of the first text, it is long with the character for matching sequence identical start-up portion of the second text Degree；And

Character length based on text similarity threshold value, identical start-up portion is modified to text matches degree.

It will be understood by those skilled in the art that when text A and text B similarity-rough set are high, and original position part When identical length is larger, then probability similar to text B text A can be higher.Above-mentioned amendment step will the factor consider exist It is interior.

Specifically, text matches degree can use following correction formula：

Wherein, d_ABFor the text matches between first, second text Degree, l is character length determined in above-mentioned amendment step, identical start-up portion, and p is modifying factor, and usual value is 0 To 0.25, preferably 0.1.

Second embodiment of the invention provides a kind of short text matching process, for finding out and treating from a short text set The one or more short texts matched with text, the matching process includes：

First, using the text matches degree computational methods between the short text of above-mentioned first embodiment offer, calculate treat respectively The text matches degree between each short text in matched text and short text set.

2nd, by the text matches degree highest short text between text in short text set and to be matched be defined as with The short text that text to be matched matches.

Fig. 2 shows the Text similarity computing method (the short text matching process based on it) and existing skill according to the present invention The technical indicator contrast of 5 kinds of Text similarity computing methods (the short text matching process based on them) in art.

Specifically, in order to portray the accuracys of 6 kinds of computational methods, the ROC curve of 6 kinds of different calculation methods is depicted, its Middle ROC curve is abscissa with negative and positive class rate (FPR), is that ordinate is drawn with real class rate (TPR).

From figure 2 it can be seen that the AUC of this 6 kinds of computational methods has been above 0.9, illustrate can all have using 6 kinds of methods Relatively good text matches effect and/or text classification effect.From the point of view of AUC index, the computational methods skill that the present invention is provided Art index preferably, take second place, and the technical indicator of editing distance computational methods is worst, LCS (most long public affairs by Jaro-Winkler computational methods Word string altogether), the effect of cosine similarity computational methods and phrase similarity computational methods be located at by-level.From the tendency of curve For, be also the computational methods that provide of the present invention closer to the upper left corner of reference axis, illustrate this method in these positions with more High Sensitivity and Specificity.

The Text similarity computing method provided using the present invention, has more preferable accurate rate and calls together under specific F1 threshold values The rate of returning；And when recall rate reaches 60%, compared with prior art in various methods, short text matching accuracy rate performance more It is good, 87.1% can be up to；In addition, for AUC standard deviations, the Text similarity computing method that the present invention is provided is on the whole Better than other method.

Described above is not lain in and limited the scope of the invention only in the preferred embodiments of the present invention.Ability Field technique personnel can make various modifications design, without departing from the thought and subsidiary claim of the present invention.

Claims

1. the text matches degree computational methods between a kind of short text, comprise the following steps：

A) participle, is carried out to the first text, the second text, to obtain the segmentation sequence of first, second text respectively；

B) segmentation sequence, based on first, second text determines the matching sequence of first text, the second text respectively Row；Wherein, the matching sequence of first text represent it is in first text, with second text in a certain word Sequence that is that symbol identical character is constituted and being arranged by sequencing of the character in first text, it is described The matching sequence of second text represent it is in second text, with first text in a certain character identical character Sequence that is being constituted and being arranged by sequencing of the character in second text；

C), browse the matching sequence of first text, determine i+1 character in the matching sequence of first text with Location interval of i-th of character in second text；

D), calculated based on each location interval, using phrase similarity computational methods between first, second text Identical characters matching degree；

E) editing distance matched between sequence of the matching sequence with second text of first text, is calculated；And

F), based on the identical characters matching degree between first, second text, the editing distance and described first, The respective string length of two texts, calculates the text matches degree between first, second text.

2. according to the method described in claim 1, it is characterised in that it also includes text matches degree amendment step：

Define text similarity threshold value；

It is determined that the word for matching sequence identical start-up portion in the matching sequence of first text, with second text Accord with length；And

Character length based on the text similarity threshold value, the identical start-up portion is repaiied to the text matches degree Just.

3. according to the method described in claim 1, it is characterised in that in the step b)：

For each character in first text, respectively only with not completing the character matched progress in second text Match somebody with somebody, and to complete matching, the character record of order at first in second text in the matching sequence of second text.

4. according to the method described in claim 1, it is characterised in that in the step d),

The calculation formula of identical characters matching degree between first, second text is：

Wherein, N represents between first text and the second text what the match is successful Character total number, Δ (A, i+1, i, B) represents the i+1 character and i-th of character in the matching sequence of first text Location interval in second text between correspondence character.

5. method according to claim 4, it is characterised in that in the step f),

The calculation formula of text matches degree between first, second text is：

Wherein, m is the identical characters matching degree between first, second text, t For the editing distance matched between sequence of matching sequence and second text of first text, | S_A|、|S_B| it is respectively The string length of first, second text.

6. method according to claim 2, it is characterised in that the correction formula of the text matches degree is：

Wherein, d_ABFor the text between first, second text With degree, l is the character length of the identical start-up portion, and p is modifying factor, and value is 0 to 0.25.

7. a kind of short text matching process, for finding out one or many matched with text to be matched from a short text set Individual short text, the matching process includes：

Text to be matched as described in being calculated respectively using the method any one of claim 1 to 6 and the short text The text matches degree between each short text in set；

Short text described in text matches degree highest between in the short text set and described text to be matched is determined For the short text matched.