CN106980870A - Text matches degree computational methods between short text - Google Patents
Text matches degree computational methods between short text Download PDFInfo
- Publication number
- CN106980870A CN106980870A CN201611256117.1A CN201611256117A CN106980870A CN 106980870 A CN106980870 A CN 106980870A CN 201611256117 A CN201611256117 A CN 201611256117A CN 106980870 A CN106980870 A CN 106980870A
- Authority
- CN
- China
- Prior art keywords
- text
- character
- matching
- sequence
- short
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to the text matches degree computational methods between a kind of short text, comprise the following steps:Participle is carried out to the first text, the second text, to obtain the segmentation sequence of first, second text respectively;The matching sequence of the first text, the second text is determined respectively;Determine location interval of the i+1 character and i-th of character in the matching sequence of the first text in the second text;The identical characters matching degree between first, second text is calculated based on each position interval, using phrase similarity computational methods;Calculate the matching sequence of the first text and the editing distance matched between sequence of the second text;And, based on the identical characters matching degree between first, second text, editing distance and the respective string length of first, second text, calculate the text matches degree between first, second text.Should in this way, the matching accuracy rate not only to text is higher, and robustness is good, it may have higher Sensitivity and Specificity.
Description
Technical field
The present invention relates to text matching techniques field, more specifically to the text matches degree between a kind of short text
Computational methods.
Background technology
The Text similarity computing method of current main flow include it is following several, but all with have the shortcomings that it is different degrees of.
First, common Jaro-Winkler computational methods are adapted to short character, and character prefix is identical bonus point, but does not account for
It is spaced between similar character, therefore (short text Similarity Measure to be matched is high, but actually not similar short for the matching of counter-example refusal
Text) effect is poor.
2nd, most long public word string computational methods ensure relatively orderly to character, and counter-example refusal matching effect is preferable, but to length
Degree and interval are sensitive, and (character similarity is high between short text to be matched, and actual really similar short essay for positive example matching
Originally) effect is poor.
3rd, the similarity calculating method based on editing distance is sensitive to character length and position sequence, and big counter-example of difference is refused
Exhausted matching effect is preferable but poor to the refusal matching effect of discrepant positive example and similar counter-example.
4th, cosine similarity method is sensitive to length and interval, refuses matching effect preferably to some counter-examples, but do not have
Consider position sequence, sequence discrepant counter-example refusal matching effect in position is poor.
5th, phrase similarity method considers the interval of identical characters, and some counter-example matching effects preferably, but are not accounted for
Position sequence, the discrepant counter-example refusal matching effect of position sequence is poor.
The content of the invention
It is an object of the present invention to provide it is a kind of have can overcome to a certain extent drawbacks described above, short text it
Between text matches degree computational methods.
To achieve the above object, a kind of technical scheme of present invention offer is as follows:
A kind of text matches degree computational methods between short text, comprise the following steps:A), to the first text, the second text
This progress participle, to obtain the segmentation sequence of first, second text respectively;B) segmentation sequence point, based on first, second text
The matching sequence of the first text, the second text is not determined;Wherein, the matching sequence of the first text represent it is in the first text,
It is being constituted with a certain character identical character in the second text and arranged by sequencing of the character in the first text
Sequence, the matching sequence of the second text represent it is in the second text, with a certain character identical character in the first text
Sequence that is being constituted and being arranged by sequencing of the character in the second text;C) the matching sequence of the first text, is browsed
Row, determine location interval of the i+1 character and i-th of character in the matching sequence of the first text in the second text;d)、
The identical characters matching degree between first, second text is calculated based on each position interval, using phrase similarity computational methods;
E) editing distance matched between sequence of the matching sequence with the second text of the first text, is calculated;And f), based on first,
Identical characters matching degree, editing distance and the respective string length of first, second text between second text, are calculated
Text matches degree between first, second text.
Preferably, this method also includes text matches degree amendment step:Define text similarity threshold value;It is determined that in the first text
In this matching sequence, the character length for matching sequence identical start-up portion with the second text;And it is similar based on text
Degree threshold value, the character length of identical start-up portion are modified to text matches degree.
Preferably, in step b):For each character in the first text, respectively only with not completed in the second text
The character matched somebody with somebody is matched, and to complete matching, the character record of order at first in the second text in the matching of the second text
In sequence.
Another object of the present invention is to provide the short text matching process that a kind of matching accuracy rate is higher.
To achieve the above object, another technical scheme of present invention offer is as follows:
A kind of short text matching process, for found out from a short text set match with text to be matched one or
Multiple short texts, matching process includes:Calculated respectively using the above method text to be matched with it is each short in short text set
Text matches degree between text;By the text matches degree highest short essay between text in short text set and to be matched
Originally it is defined as the short text matched.
Text matches degree computational methods between short text provided by the present invention, can more accurately calculate short text it
Between matching degree, matching accuracy rate that should be in this way not only to text is higher, and robustness is good, it may have higher
Sensitivity and Specificity.Short text matching process provided by the present invention, positive example matching and the matching of counter-example refusal all have good
Good effect, so as to have higher matching accuracy rate compared to prior art.
Brief description of the drawings
Fig. 1 shows the flow chart of the text matches degree computational methods between the short text that first embodiment of the invention is provided.
Fig. 2 shows the Text similarity computing method and Text similarity computing method of the prior art according to the present invention
Technical indicator contrast.
Embodiment
As shown in figure 1, first embodiment of the invention provides the text matches degree computational methods between a kind of short text, it is wrapped
Include following steps rapid.
Step S10, to the first text, the second text carry out participle, to obtain the participle sequence of first, second text respectively
Row.
As an example, divided respectively for the two text A (the first text) and B (the second text) to be matched
Word, due to being to be directed to short text, can use the segmenting method of full cutting and word-based frequency statisticses.Participle is obtained after participle
Sequence A=" a0a1...am-1", B=" b0b1...bn-1", wherein m≤n, A represents shorter character string text, and B represents longer
Character string text.
Step S11, the segmentation sequence based on first, second text determine the matching sequence of the first text, the second text respectively
Row.
Wherein, the matching sequence of the first text represents in the first text, identical with a certain character in the second text
Character constituted and the sequence that is arranged by sequencing of the character in the first text, the matching sequence of the second text
Represent in the second text, with the first text in a certain character identical character constituted and by character second text
Sequencing in this and the sequence arranged.
Continue the example presented above, if i-th of character a in text AiThere is identical characters b in text BkIt is matching, then count
Calculate position of the character in text B:
C (A, i, B)=k | bk=ai, k=0,1 ..., n-1 }, wherein i=0,1 ..., m-1.
Under preferable case, for each character in first text, only matched respectively with not completed in the second text
Character matched, and to complete matching, order character record at first in the second text in the matching sequence of the second text
In row, and then form the whole matching sequence of the second text.
Step S12, the matching sequence for browsing the first text, determine the i+1 character in the matching sequence of the first text
With location interval of i-th of character in the second text.
Step S13, calculate based on each position interval, using phrase similarity computational methods between first, second text
Identical characters matching degree.
Continue the example presented above, if there is i-th of character and i+1 character word corresponding with text B respectively in text A
Symbol matches, then between the position of calculating i-th of character that the match is successful and i+1 character in text B between correspondence character
Every:
Δ (A, i+1, i, B)=C (A, i+1, B)-C (A, i, B)
And then, the identical characters matching degree between first, second text, which can be used, is calculated as below formula:
Wherein, N represents that the character that the match is successful between text A and text B is always individual
I+1 character and i-th of character in number, the matching sequence of Δ (A, i+1, i, B) the first text of expression is in the second text
Location interval between correspondence character.
It will be appreciated by those skilled in the art that the location interval between meter and character, can improve the effect of counter-example refusal matching.
Step S14, calculate the first text matching sequence and the second text the editing distance matched between sequence.
Specifically, for all character sets that the match is successful in text A and text B, according to elder generation of each character in A
Order, constitutes character string ms afterwards1;Meanwhile, according to sequencing of each character in B, constitute character string ms2。
And then, character string ms1With character string ms2Between editing distance (Levenshtein distances) can be expressed as:
T=d (ms1,ms2), wherein, d represents to ask for the editing distance between character string.
It will be understood by those skilled in the art that using editing distance as text matches degree the calculating factor, it can be ensured that word
Accord with length and position sequence sensitiveness so that the big counter-example refusal matching effect of some differences is preferable.
Step S15, the identical characters matching degree based between first, second text, editing distance and first, second
The respective string length of text, calculates the text matches degree between first, second text.
As an example, the calculation formula of the text matches degree between first, second text is:
Wherein, m is the identical characters matching degree between first, second text, and t is
The matching sequence of first text and the editing distance matched between sequence of the second text, | SA|、|SB| it is respectively first, second
The string length of text.
According to the improvement embodiment of the first embodiment, this method also includes a text matches degree amendment step:
Define text similarity threshold value;
It is determined that in the matching sequence of the first text, it is long with the character for matching sequence identical start-up portion of the second text
Degree;And
Character length based on text similarity threshold value, identical start-up portion is modified to text matches degree.
It will be understood by those skilled in the art that when text A and text B similarity-rough set are high, and original position part
When identical length is larger, then probability similar to text B text A can be higher.Above-mentioned amendment step will the factor consider exist
It is interior.
Specifically, text matches degree can use following correction formula:
Wherein, dABFor the text matches between first, second text
Degree, l is character length determined in above-mentioned amendment step, identical start-up portion, and p is modifying factor, and usual value is 0
To 0.25, preferably 0.1.
Second embodiment of the invention provides a kind of short text matching process, for finding out and treating from a short text set
The one or more short texts matched with text, the matching process includes:
First, using the text matches degree computational methods between the short text of above-mentioned first embodiment offer, calculate treat respectively
The text matches degree between each short text in matched text and short text set.
2nd, by the text matches degree highest short text between text in short text set and to be matched be defined as with
The short text that text to be matched matches.
Fig. 2 shows the Text similarity computing method (the short text matching process based on it) and existing skill according to the present invention
The technical indicator contrast of 5 kinds of Text similarity computing methods (the short text matching process based on them) in art.
Specifically, in order to portray the accuracys of 6 kinds of computational methods, the ROC curve of 6 kinds of different calculation methods is depicted, its
Middle ROC curve is abscissa with negative and positive class rate (FPR), is that ordinate is drawn with real class rate (TPR).
From figure 2 it can be seen that the AUC of this 6 kinds of computational methods has been above 0.9, illustrate can all have using 6 kinds of methods
Relatively good text matches effect and/or text classification effect.From the point of view of AUC index, the computational methods skill that the present invention is provided
Art index preferably, take second place, and the technical indicator of editing distance computational methods is worst, LCS (most long public affairs by Jaro-Winkler computational methods
Word string altogether), the effect of cosine similarity computational methods and phrase similarity computational methods be located at by-level.From the tendency of curve
For, be also the computational methods that provide of the present invention closer to the upper left corner of reference axis, illustrate this method in these positions with more
High Sensitivity and Specificity.
The Text similarity computing method provided using the present invention, has more preferable accurate rate and calls together under specific F1 threshold values
The rate of returning;And when recall rate reaches 60%, compared with prior art in various methods, short text matching accuracy rate performance more
It is good, 87.1% can be up to;In addition, for AUC standard deviations, the Text similarity computing method that the present invention is provided is on the whole
Better than other method.
Described above is not lain in and limited the scope of the invention only in the preferred embodiments of the present invention.Ability
Field technique personnel can make various modifications design, without departing from the thought and subsidiary claim of the present invention.
Claims (7)
1. the text matches degree computational methods between a kind of short text, comprise the following steps:
A) participle, is carried out to the first text, the second text, to obtain the segmentation sequence of first, second text respectively;
B) segmentation sequence, based on first, second text determines the matching sequence of first text, the second text respectively
Row;Wherein, the matching sequence of first text represent it is in first text, with second text in a certain word
Sequence that is that symbol identical character is constituted and being arranged by sequencing of the character in first text, it is described
The matching sequence of second text represent it is in second text, with first text in a certain character identical character
Sequence that is being constituted and being arranged by sequencing of the character in second text;
C), browse the matching sequence of first text, determine i+1 character in the matching sequence of first text with
Location interval of i-th of character in second text;
D), calculated based on each location interval, using phrase similarity computational methods between first, second text
Identical characters matching degree;
E) editing distance matched between sequence of the matching sequence with second text of first text, is calculated;And
F), based on the identical characters matching degree between first, second text, the editing distance and described first,
The respective string length of two texts, calculates the text matches degree between first, second text.
2. according to the method described in claim 1, it is characterised in that it also includes text matches degree amendment step:
Define text similarity threshold value;
It is determined that the word for matching sequence identical start-up portion in the matching sequence of first text, with second text
Accord with length;And
Character length based on the text similarity threshold value, the identical start-up portion is repaiied to the text matches degree
Just.
3. according to the method described in claim 1, it is characterised in that in the step b):
For each character in first text, respectively only with not completing the character matched progress in second text
Match somebody with somebody, and to complete matching, the character record of order at first in second text in the matching sequence of second text.
4. according to the method described in claim 1, it is characterised in that in the step d),
The calculation formula of identical characters matching degree between first, second text is:
Wherein, N represents between first text and the second text what the match is successful
Character total number, Δ (A, i+1, i, B) represents the i+1 character and i-th of character in the matching sequence of first text
Location interval in second text between correspondence character.
5. method according to claim 4, it is characterised in that in the step f),
The calculation formula of text matches degree between first, second text is:
Wherein, m is the identical characters matching degree between first, second text, t
For the editing distance matched between sequence of matching sequence and second text of first text, | SA|、|SB| it is respectively
The string length of first, second text.
6. method according to claim 2, it is characterised in that the correction formula of the text matches degree is:
Wherein, dABFor the text between first, second text
With degree, l is the character length of the identical start-up portion, and p is modifying factor, and value is 0 to 0.25.
7. a kind of short text matching process, for finding out one or many matched with text to be matched from a short text set
Individual short text, the matching process includes:
Text to be matched as described in being calculated respectively using the method any one of claim 1 to 6 and the short text
The text matches degree between each short text in set;
Short text described in text matches degree highest between in the short text set and described text to be matched is determined
For the short text matched.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611256117.1A CN106980870B (en) | 2016-12-30 | 2016-12-30 | Method for calculating text matching degree between short texts |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611256117.1A CN106980870B (en) | 2016-12-30 | 2016-12-30 | Method for calculating text matching degree between short texts |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106980870A true CN106980870A (en) | 2017-07-25 |
CN106980870B CN106980870B (en) | 2020-07-28 |
Family
ID=59340951
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611256117.1A Active CN106980870B (en) | 2016-12-30 | 2016-12-30 | Method for calculating text matching degree between short texts |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106980870B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109830229A (en) * | 2018-12-11 | 2019-05-31 | 平安科技(深圳)有限公司 | Audio corpus intelligence cleaning method, device, storage medium and computer equipment |
CN110147429A (en) * | 2019-04-15 | 2019-08-20 | 平安科技(深圳)有限公司 | Text comparative approach, device, computer equipment and storage medium |
CN111144104A (en) * | 2018-11-02 | 2020-05-12 | 中国电信股份有限公司 | Text similarity determination method and device and computer readable storage medium |
CN111368061A (en) * | 2018-12-25 | 2020-07-03 | 深圳市优必选科技有限公司 | Short text filtering method, device, medium and computer equipment |
CN112926297A (en) * | 2021-02-26 | 2021-06-08 | 北京百度网讯科技有限公司 | Method, apparatus, device and storage medium for processing information |
CN113094559A (en) * | 2021-04-25 | 2021-07-09 | 百度在线网络技术(北京)有限公司 | Information matching method and device, electronic equipment and storage medium |
CN115392939A (en) * | 2022-10-28 | 2022-11-25 | 中国环境科学研究院 | Hazardous waste tracing method based on retrieval contrast and matching degree calculation |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104252445A (en) * | 2013-06-26 | 2014-12-31 | 华为技术有限公司 | Document similarity calculation method and near-duplicate document detection method and device |
US20160283583A1 (en) * | 2014-03-14 | 2016-09-29 | Tencent Technology (Shenzhen) Company Limited | Method, apparatus, and storage medium for text information processing |
CN106021223A (en) * | 2016-05-09 | 2016-10-12 | Tcl集团股份有限公司 | Sentence similarity calculation method and system |
CN106033416A (en) * | 2015-03-09 | 2016-10-19 | 阿里巴巴集团控股有限公司 | A string processing method and device |
-
2016
- 2016-12-30 CN CN201611256117.1A patent/CN106980870B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104252445A (en) * | 2013-06-26 | 2014-12-31 | 华为技术有限公司 | Document similarity calculation method and near-duplicate document detection method and device |
US20160283583A1 (en) * | 2014-03-14 | 2016-09-29 | Tencent Technology (Shenzhen) Company Limited | Method, apparatus, and storage medium for text information processing |
CN106033416A (en) * | 2015-03-09 | 2016-10-19 | 阿里巴巴集团控股有限公司 | A string processing method and device |
CN106021223A (en) * | 2016-05-09 | 2016-10-12 | Tcl集团股份有限公司 | Sentence similarity calculation method and system |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111144104A (en) * | 2018-11-02 | 2020-05-12 | 中国电信股份有限公司 | Text similarity determination method and device and computer readable storage medium |
CN109830229A (en) * | 2018-12-11 | 2019-05-31 | 平安科技(深圳)有限公司 | Audio corpus intelligence cleaning method, device, storage medium and computer equipment |
CN111368061A (en) * | 2018-12-25 | 2020-07-03 | 深圳市优必选科技有限公司 | Short text filtering method, device, medium and computer equipment |
CN111368061B (en) * | 2018-12-25 | 2024-04-12 | 深圳市优必选科技有限公司 | Short text filtering method, device, medium and computer equipment |
CN110147429A (en) * | 2019-04-15 | 2019-08-20 | 平安科技(深圳)有限公司 | Text comparative approach, device, computer equipment and storage medium |
CN110147429B (en) * | 2019-04-15 | 2023-08-15 | 平安科技(深圳)有限公司 | Text comparison method, apparatus, computer device and storage medium |
CN112926297A (en) * | 2021-02-26 | 2021-06-08 | 北京百度网讯科技有限公司 | Method, apparatus, device and storage medium for processing information |
CN112926297B (en) * | 2021-02-26 | 2023-06-30 | 北京百度网讯科技有限公司 | Method, apparatus, device and storage medium for processing information |
CN113094559A (en) * | 2021-04-25 | 2021-07-09 | 百度在线网络技术(北京)有限公司 | Information matching method and device, electronic equipment and storage medium |
CN113094559B (en) * | 2021-04-25 | 2024-05-31 | 百度在线网络技术(北京)有限公司 | Information matching method, device, electronic equipment and storage medium |
CN115392939A (en) * | 2022-10-28 | 2022-11-25 | 中国环境科学研究院 | Hazardous waste tracing method based on retrieval contrast and matching degree calculation |
Also Published As
Publication number | Publication date |
---|---|
CN106980870B (en) | 2020-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106980870A (en) | Text matches degree computational methods between short text | |
TWI664540B (en) | Search word error correction method and device, and weighted edit distance calculation method and device | |
WO2019184217A1 (en) | Hotspot event classification method and apparatus, and storage medium | |
CN103336766B (en) | Short text garbage identification and modeling method and device | |
CN112488133B (en) | Video/picture-text cross-modal retrieval method | |
CN109147767A (en) | Digit recognition method, device, computer equipment and storage medium in voice | |
CN105068997B (en) | The construction method and device of parallel corpora | |
CN110674396B (en) | Text information processing method and device, electronic equipment and readable storage medium | |
CN113761880B (en) | Data processing method for text verification, electronic equipment and storage medium | |
CN106708798B (en) | Character string segmentation method and device | |
CN103927532B (en) | Person's handwriting method for registering based on stroke feature | |
WO2020114100A1 (en) | Information processing method and apparatus, and computer storage medium | |
WO2014022172A2 (en) | Information classification based on product recognition | |
CN106980620A (en) | A kind of method and device matched to Chinese character string | |
WO2019201024A1 (en) | Method, apparatus and device for updating model parameter, and storage medium | |
CN107229939B (en) | Similar document judgment method and device | |
CN106610990A (en) | Emotional tendency analysis method and apparatus | |
CN108132917B (en) | Document error correction marking method | |
CN111340020A (en) | Formula identification method, device, equipment and storage medium | |
CN110705261B (en) | Chinese text word segmentation method and system thereof | |
CN113033204A (en) | Information entity extraction method and device, electronic equipment and storage medium | |
CN106888201A (en) | A kind of method of calibration and device | |
CN109388696A (en) | Delete method, apparatus, storage medium and the electronic equipment of rumour article | |
CN111950267B (en) | Text triplet extraction method and device, electronic equipment and storage medium | |
EP3703061A1 (en) | Image retrieval |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |