CN106980870A - Text matches degree computational methods between short text - Google Patents

Text matches degree computational methods between short text Download PDF

Info

Publication number
CN106980870A
CN106980870A CN201611256117.1A CN201611256117A CN106980870A CN 106980870 A CN106980870 A CN 106980870A CN 201611256117 A CN201611256117 A CN 201611256117A CN 106980870 A CN106980870 A CN 106980870A
Authority
CN
China
Prior art keywords
text
character
matching
sequence
short
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611256117.1A
Other languages
Chinese (zh)
Other versions
CN106980870B (en
Inventor
王宇
华锦芝
郑建宾
张琦
冯亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN201611256117.1A priority Critical patent/CN106980870B/en
Publication of CN106980870A publication Critical patent/CN106980870A/en
Application granted granted Critical
Publication of CN106980870B publication Critical patent/CN106980870B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to the text matches degree computational methods between a kind of short text, comprise the following steps:Participle is carried out to the first text, the second text, to obtain the segmentation sequence of first, second text respectively;The matching sequence of the first text, the second text is determined respectively;Determine location interval of the i+1 character and i-th of character in the matching sequence of the first text in the second text;The identical characters matching degree between first, second text is calculated based on each position interval, using phrase similarity computational methods;Calculate the matching sequence of the first text and the editing distance matched between sequence of the second text;And, based on the identical characters matching degree between first, second text, editing distance and the respective string length of first, second text, calculate the text matches degree between first, second text.Should in this way, the matching accuracy rate not only to text is higher, and robustness is good, it may have higher Sensitivity and Specificity.

Description

Text matches degree computational methods between short text
Technical field
The present invention relates to text matching techniques field, more specifically to the text matches degree between a kind of short text Computational methods.
Background technology
The Text similarity computing method of current main flow include it is following several, but all with have the shortcomings that it is different degrees of.
First, common Jaro-Winkler computational methods are adapted to short character, and character prefix is identical bonus point, but does not account for It is spaced between similar character, therefore (short text Similarity Measure to be matched is high, but actually not similar short for the matching of counter-example refusal Text) effect is poor.
2nd, most long public word string computational methods ensure relatively orderly to character, and counter-example refusal matching effect is preferable, but to length Degree and interval are sensitive, and (character similarity is high between short text to be matched, and actual really similar short essay for positive example matching Originally) effect is poor.
3rd, the similarity calculating method based on editing distance is sensitive to character length and position sequence, and big counter-example of difference is refused Exhausted matching effect is preferable but poor to the refusal matching effect of discrepant positive example and similar counter-example.
4th, cosine similarity method is sensitive to length and interval, refuses matching effect preferably to some counter-examples, but do not have Consider position sequence, sequence discrepant counter-example refusal matching effect in position is poor.
5th, phrase similarity method considers the interval of identical characters, and some counter-example matching effects preferably, but are not accounted for Position sequence, the discrepant counter-example refusal matching effect of position sequence is poor.
The content of the invention
It is an object of the present invention to provide it is a kind of have can overcome to a certain extent drawbacks described above, short text it Between text matches degree computational methods.
To achieve the above object, a kind of technical scheme of present invention offer is as follows:
A kind of text matches degree computational methods between short text, comprise the following steps:A), to the first text, the second text This progress participle, to obtain the segmentation sequence of first, second text respectively;B) segmentation sequence point, based on first, second text The matching sequence of the first text, the second text is not determined;Wherein, the matching sequence of the first text represent it is in the first text, It is being constituted with a certain character identical character in the second text and arranged by sequencing of the character in the first text Sequence, the matching sequence of the second text represent it is in the second text, with a certain character identical character in the first text Sequence that is being constituted and being arranged by sequencing of the character in the second text;C) the matching sequence of the first text, is browsed Row, determine location interval of the i+1 character and i-th of character in the matching sequence of the first text in the second text;d)、 The identical characters matching degree between first, second text is calculated based on each position interval, using phrase similarity computational methods; E) editing distance matched between sequence of the matching sequence with the second text of the first text, is calculated;And f), based on first, Identical characters matching degree, editing distance and the respective string length of first, second text between second text, are calculated Text matches degree between first, second text.
Preferably, this method also includes text matches degree amendment step:Define text similarity threshold value;It is determined that in the first text In this matching sequence, the character length for matching sequence identical start-up portion with the second text;And it is similar based on text Degree threshold value, the character length of identical start-up portion are modified to text matches degree.
Preferably, in step b):For each character in the first text, respectively only with not completed in the second text The character matched somebody with somebody is matched, and to complete matching, the character record of order at first in the second text in the matching of the second text In sequence.
Another object of the present invention is to provide the short text matching process that a kind of matching accuracy rate is higher.
To achieve the above object, another technical scheme of present invention offer is as follows:
A kind of short text matching process, for found out from a short text set match with text to be matched one or Multiple short texts, matching process includes:Calculated respectively using the above method text to be matched with it is each short in short text set Text matches degree between text;By the text matches degree highest short essay between text in short text set and to be matched Originally it is defined as the short text matched.
Text matches degree computational methods between short text provided by the present invention, can more accurately calculate short text it Between matching degree, matching accuracy rate that should be in this way not only to text is higher, and robustness is good, it may have higher Sensitivity and Specificity.Short text matching process provided by the present invention, positive example matching and the matching of counter-example refusal all have good Good effect, so as to have higher matching accuracy rate compared to prior art.
Brief description of the drawings
Fig. 1 shows the flow chart of the text matches degree computational methods between the short text that first embodiment of the invention is provided.
Fig. 2 shows the Text similarity computing method and Text similarity computing method of the prior art according to the present invention Technical indicator contrast.
Embodiment
As shown in figure 1, first embodiment of the invention provides the text matches degree computational methods between a kind of short text, it is wrapped Include following steps rapid.
Step S10, to the first text, the second text carry out participle, to obtain the participle sequence of first, second text respectively Row.
As an example, divided respectively for the two text A (the first text) and B (the second text) to be matched Word, due to being to be directed to short text, can use the segmenting method of full cutting and word-based frequency statisticses.Participle is obtained after participle Sequence A=" a0a1...am-1", B=" b0b1...bn-1", wherein m≤n, A represents shorter character string text, and B represents longer Character string text.
Step S11, the segmentation sequence based on first, second text determine the matching sequence of the first text, the second text respectively Row.
Wherein, the matching sequence of the first text represents in the first text, identical with a certain character in the second text Character constituted and the sequence that is arranged by sequencing of the character in the first text, the matching sequence of the second text Represent in the second text, with the first text in a certain character identical character constituted and by character second text Sequencing in this and the sequence arranged.
Continue the example presented above, if i-th of character a in text AiThere is identical characters b in text BkIt is matching, then count Calculate position of the character in text B:
C (A, i, B)=k | bk=ai, k=0,1 ..., n-1 }, wherein i=0,1 ..., m-1.
Under preferable case, for each character in first text, only matched respectively with not completed in the second text Character matched, and to complete matching, order character record at first in the second text in the matching sequence of the second text In row, and then form the whole matching sequence of the second text.
Step S12, the matching sequence for browsing the first text, determine the i+1 character in the matching sequence of the first text With location interval of i-th of character in the second text.
Step S13, calculate based on each position interval, using phrase similarity computational methods between first, second text Identical characters matching degree.
Continue the example presented above, if there is i-th of character and i+1 character word corresponding with text B respectively in text A Symbol matches, then between the position of calculating i-th of character that the match is successful and i+1 character in text B between correspondence character Every:
Δ (A, i+1, i, B)=C (A, i+1, B)-C (A, i, B)
And then, the identical characters matching degree between first, second text, which can be used, is calculated as below formula:
Wherein, N represents that the character that the match is successful between text A and text B is always individual I+1 character and i-th of character in number, the matching sequence of Δ (A, i+1, i, B) the first text of expression is in the second text Location interval between correspondence character.
It will be appreciated by those skilled in the art that the location interval between meter and character, can improve the effect of counter-example refusal matching.
Step S14, calculate the first text matching sequence and the second text the editing distance matched between sequence.
Specifically, for all character sets that the match is successful in text A and text B, according to elder generation of each character in A Order, constitutes character string ms afterwards1;Meanwhile, according to sequencing of each character in B, constitute character string ms2
And then, character string ms1With character string ms2Between editing distance (Levenshtein distances) can be expressed as:
T=d (ms1,ms2), wherein, d represents to ask for the editing distance between character string.
It will be understood by those skilled in the art that using editing distance as text matches degree the calculating factor, it can be ensured that word Accord with length and position sequence sensitiveness so that the big counter-example refusal matching effect of some differences is preferable.
Step S15, the identical characters matching degree based between first, second text, editing distance and first, second The respective string length of text, calculates the text matches degree between first, second text.
As an example, the calculation formula of the text matches degree between first, second text is:
Wherein, m is the identical characters matching degree between first, second text, and t is The matching sequence of first text and the editing distance matched between sequence of the second text, | SA|、|SB| it is respectively first, second The string length of text.
According to the improvement embodiment of the first embodiment, this method also includes a text matches degree amendment step:
Define text similarity threshold value;
It is determined that in the matching sequence of the first text, it is long with the character for matching sequence identical start-up portion of the second text Degree;And
Character length based on text similarity threshold value, identical start-up portion is modified to text matches degree.
It will be understood by those skilled in the art that when text A and text B similarity-rough set are high, and original position part When identical length is larger, then probability similar to text B text A can be higher.Above-mentioned amendment step will the factor consider exist It is interior.
Specifically, text matches degree can use following correction formula:
Wherein, dABFor the text matches between first, second text Degree, l is character length determined in above-mentioned amendment step, identical start-up portion, and p is modifying factor, and usual value is 0 To 0.25, preferably 0.1.
Second embodiment of the invention provides a kind of short text matching process, for finding out and treating from a short text set The one or more short texts matched with text, the matching process includes:
First, using the text matches degree computational methods between the short text of above-mentioned first embodiment offer, calculate treat respectively The text matches degree between each short text in matched text and short text set.
2nd, by the text matches degree highest short text between text in short text set and to be matched be defined as with The short text that text to be matched matches.
Fig. 2 shows the Text similarity computing method (the short text matching process based on it) and existing skill according to the present invention The technical indicator contrast of 5 kinds of Text similarity computing methods (the short text matching process based on them) in art.
Specifically, in order to portray the accuracys of 6 kinds of computational methods, the ROC curve of 6 kinds of different calculation methods is depicted, its Middle ROC curve is abscissa with negative and positive class rate (FPR), is that ordinate is drawn with real class rate (TPR).
From figure 2 it can be seen that the AUC of this 6 kinds of computational methods has been above 0.9, illustrate can all have using 6 kinds of methods Relatively good text matches effect and/or text classification effect.From the point of view of AUC index, the computational methods skill that the present invention is provided Art index preferably, take second place, and the technical indicator of editing distance computational methods is worst, LCS (most long public affairs by Jaro-Winkler computational methods Word string altogether), the effect of cosine similarity computational methods and phrase similarity computational methods be located at by-level.From the tendency of curve For, be also the computational methods that provide of the present invention closer to the upper left corner of reference axis, illustrate this method in these positions with more High Sensitivity and Specificity.
The Text similarity computing method provided using the present invention, has more preferable accurate rate and calls together under specific F1 threshold values The rate of returning;And when recall rate reaches 60%, compared with prior art in various methods, short text matching accuracy rate performance more It is good, 87.1% can be up to;In addition, for AUC standard deviations, the Text similarity computing method that the present invention is provided is on the whole Better than other method.
Described above is not lain in and limited the scope of the invention only in the preferred embodiments of the present invention.Ability Field technique personnel can make various modifications design, without departing from the thought and subsidiary claim of the present invention.

Claims (7)

1. the text matches degree computational methods between a kind of short text, comprise the following steps:
A) participle, is carried out to the first text, the second text, to obtain the segmentation sequence of first, second text respectively;
B) segmentation sequence, based on first, second text determines the matching sequence of first text, the second text respectively Row;Wherein, the matching sequence of first text represent it is in first text, with second text in a certain word Sequence that is that symbol identical character is constituted and being arranged by sequencing of the character in first text, it is described The matching sequence of second text represent it is in second text, with first text in a certain character identical character Sequence that is being constituted and being arranged by sequencing of the character in second text;
C), browse the matching sequence of first text, determine i+1 character in the matching sequence of first text with Location interval of i-th of character in second text;
D), calculated based on each location interval, using phrase similarity computational methods between first, second text Identical characters matching degree;
E) editing distance matched between sequence of the matching sequence with second text of first text, is calculated;And
F), based on the identical characters matching degree between first, second text, the editing distance and described first, The respective string length of two texts, calculates the text matches degree between first, second text.
2. according to the method described in claim 1, it is characterised in that it also includes text matches degree amendment step:
Define text similarity threshold value;
It is determined that the word for matching sequence identical start-up portion in the matching sequence of first text, with second text Accord with length;And
Character length based on the text similarity threshold value, the identical start-up portion is repaiied to the text matches degree Just.
3. according to the method described in claim 1, it is characterised in that in the step b):
For each character in first text, respectively only with not completing the character matched progress in second text Match somebody with somebody, and to complete matching, the character record of order at first in second text in the matching sequence of second text.
4. according to the method described in claim 1, it is characterised in that in the step d),
The calculation formula of identical characters matching degree between first, second text is:
Wherein, N represents between first text and the second text what the match is successful Character total number, Δ (A, i+1, i, B) represents the i+1 character and i-th of character in the matching sequence of first text Location interval in second text between correspondence character.
5. method according to claim 4, it is characterised in that in the step f),
The calculation formula of text matches degree between first, second text is:
Wherein, m is the identical characters matching degree between first, second text, t For the editing distance matched between sequence of matching sequence and second text of first text, | SA|、|SB| it is respectively The string length of first, second text.
6. method according to claim 2, it is characterised in that the correction formula of the text matches degree is:
Wherein, dABFor the text between first, second text With degree, l is the character length of the identical start-up portion, and p is modifying factor, and value is 0 to 0.25.
7. a kind of short text matching process, for finding out one or many matched with text to be matched from a short text set Individual short text, the matching process includes:
Text to be matched as described in being calculated respectively using the method any one of claim 1 to 6 and the short text The text matches degree between each short text in set;
Short text described in text matches degree highest between in the short text set and described text to be matched is determined For the short text matched.
CN201611256117.1A 2016-12-30 2016-12-30 Method for calculating text matching degree between short texts Active CN106980870B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611256117.1A CN106980870B (en) 2016-12-30 2016-12-30 Method for calculating text matching degree between short texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611256117.1A CN106980870B (en) 2016-12-30 2016-12-30 Method for calculating text matching degree between short texts

Publications (2)

Publication Number Publication Date
CN106980870A true CN106980870A (en) 2017-07-25
CN106980870B CN106980870B (en) 2020-07-28

Family

ID=59340951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611256117.1A Active CN106980870B (en) 2016-12-30 2016-12-30 Method for calculating text matching degree between short texts

Country Status (1)

Country Link
CN (1) CN106980870B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109830229A (en) * 2018-12-11 2019-05-31 平安科技(深圳)有限公司 Audio corpus intelligence cleaning method, device, storage medium and computer equipment
CN110147429A (en) * 2019-04-15 2019-08-20 平安科技(深圳)有限公司 Text comparative approach, device, computer equipment and storage medium
CN111144104A (en) * 2018-11-02 2020-05-12 中国电信股份有限公司 Text similarity determination method and device and computer readable storage medium
CN111368061A (en) * 2018-12-25 2020-07-03 深圳市优必选科技有限公司 Short text filtering method, device, medium and computer equipment
CN112926297A (en) * 2021-02-26 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing information
CN113094559A (en) * 2021-04-25 2021-07-09 百度在线网络技术(北京)有限公司 Information matching method and device, electronic equipment and storage medium
CN115392939A (en) * 2022-10-28 2022-11-25 中国环境科学研究院 Hazardous waste tracing method based on retrieval contrast and matching degree calculation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104252445A (en) * 2013-06-26 2014-12-31 华为技术有限公司 Document similarity calculation method and near-duplicate document detection method and device
US20160283583A1 (en) * 2014-03-14 2016-09-29 Tencent Technology (Shenzhen) Company Limited Method, apparatus, and storage medium for text information processing
CN106021223A (en) * 2016-05-09 2016-10-12 Tcl集团股份有限公司 Sentence similarity calculation method and system
CN106033416A (en) * 2015-03-09 2016-10-19 阿里巴巴集团控股有限公司 A string processing method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104252445A (en) * 2013-06-26 2014-12-31 华为技术有限公司 Document similarity calculation method and near-duplicate document detection method and device
US20160283583A1 (en) * 2014-03-14 2016-09-29 Tencent Technology (Shenzhen) Company Limited Method, apparatus, and storage medium for text information processing
CN106033416A (en) * 2015-03-09 2016-10-19 阿里巴巴集团控股有限公司 A string processing method and device
CN106021223A (en) * 2016-05-09 2016-10-12 Tcl集团股份有限公司 Sentence similarity calculation method and system

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111144104A (en) * 2018-11-02 2020-05-12 中国电信股份有限公司 Text similarity determination method and device and computer readable storage medium
CN109830229A (en) * 2018-12-11 2019-05-31 平安科技(深圳)有限公司 Audio corpus intelligence cleaning method, device, storage medium and computer equipment
CN111368061A (en) * 2018-12-25 2020-07-03 深圳市优必选科技有限公司 Short text filtering method, device, medium and computer equipment
CN111368061B (en) * 2018-12-25 2024-04-12 深圳市优必选科技有限公司 Short text filtering method, device, medium and computer equipment
CN110147429A (en) * 2019-04-15 2019-08-20 平安科技(深圳)有限公司 Text comparative approach, device, computer equipment and storage medium
CN110147429B (en) * 2019-04-15 2023-08-15 平安科技(深圳)有限公司 Text comparison method, apparatus, computer device and storage medium
CN112926297A (en) * 2021-02-26 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing information
CN112926297B (en) * 2021-02-26 2023-06-30 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for processing information
CN113094559A (en) * 2021-04-25 2021-07-09 百度在线网络技术(北京)有限公司 Information matching method and device, electronic equipment and storage medium
CN113094559B (en) * 2021-04-25 2024-05-31 百度在线网络技术(北京)有限公司 Information matching method, device, electronic equipment and storage medium
CN115392939A (en) * 2022-10-28 2022-11-25 中国环境科学研究院 Hazardous waste tracing method based on retrieval contrast and matching degree calculation

Also Published As

Publication number Publication date
CN106980870B (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN106980870A (en) Text matches degree computational methods between short text
TWI664540B (en) Search word error correction method and device, and weighted edit distance calculation method and device
WO2019184217A1 (en) Hotspot event classification method and apparatus, and storage medium
CN103336766B (en) Short text garbage identification and modeling method and device
CN112488133B (en) Video/picture-text cross-modal retrieval method
CN109147767A (en) Digit recognition method, device, computer equipment and storage medium in voice
CN105068997B (en) The construction method and device of parallel corpora
CN110674396B (en) Text information processing method and device, electronic equipment and readable storage medium
CN113761880B (en) Data processing method for text verification, electronic equipment and storage medium
CN106708798B (en) Character string segmentation method and device
CN103927532B (en) Person's handwriting method for registering based on stroke feature
WO2020114100A1 (en) Information processing method and apparatus, and computer storage medium
WO2014022172A2 (en) Information classification based on product recognition
CN106980620A (en) A kind of method and device matched to Chinese character string
WO2019201024A1 (en) Method, apparatus and device for updating model parameter, and storage medium
CN107229939B (en) Similar document judgment method and device
CN106610990A (en) Emotional tendency analysis method and apparatus
CN108132917B (en) Document error correction marking method
CN111340020A (en) Formula identification method, device, equipment and storage medium
CN110705261B (en) Chinese text word segmentation method and system thereof
CN113033204A (en) Information entity extraction method and device, electronic equipment and storage medium
CN106888201A (en) A kind of method of calibration and device
CN109388696A (en) Delete method, apparatus, storage medium and the electronic equipment of rumour article
CN111950267B (en) Text triplet extraction method and device, electronic equipment and storage medium
EP3703061A1 (en) Image retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant