CN111159996B - Short text set similarity comparison method and system based on text fingerprint algorithm - Google Patents

Short text set similarity comparison method and system based on text fingerprint algorithm Download PDF

Info

Publication number
CN111159996B
CN111159996B CN201911401853.5A CN201911401853A CN111159996B CN 111159996 B CN111159996 B CN 111159996B CN 201911401853 A CN201911401853 A CN 201911401853A CN 111159996 B CN111159996 B CN 111159996B
Authority
CN
China
Prior art keywords
text
shift
similarity
word
texts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911401853.5A
Other languages
Chinese (zh)
Other versions
CN111159996A (en
Inventor
邱平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Funo Mobile Communication Technology Co ltd
Original Assignee
Fujian Funo Mobile Communication Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Funo Mobile Communication Technology Co ltd filed Critical Fujian Funo Mobile Communication Technology Co ltd
Priority to CN201911401853.5A priority Critical patent/CN111159996B/en
Publication of CN111159996A publication Critical patent/CN111159996A/en
Application granted granted Critical
Publication of CN111159996B publication Critical patent/CN111159996B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a short text set similarity comparison method and a system based on an improved text fingerprint algorithm, wherein word segmentation processing is firstly carried out on each text to obtain a word set of each text; then, filtering stop words of the word set of each text; then dynamically setting a K value for each text, and extracting K-shift from the filtered word set of stop words to obtain a K-shift set of each text; and finally, calculating the similarity between the two texts according to the K-Shift set of each text. The invention can improve the accuracy of the interface protocol text similarity comparison.

Description

Short text set similarity comparison method and system based on text fingerprint algorithm
Technical Field
The invention relates to the technical field of computer text information processing, in particular to a short text set similarity comparison method and system based on an improved text fingerprint algorithm.
Background
In the internet era, a large amount of repeated content and information are enriched on the network, and no matter the de-duplication and filtering of a search engine, the de-duplication and anti-piracy of a media platform and the like, the similarity comparison of a large amount of text information needs to be efficiently and accurately carried out.
The existing typical text deduplication method adopts a fingerprint algorithm, namely, firstly performing word segmentation on a text, then calculating TF-IDF of a document, sorting according to TF-IDF, extracting words which are sorted at the front as feature words, constructing a fingerprint for each text by using a HASH function or other rules to serve as an identifier of the text, and judging the duplication degree of text information according to the fingerprint.
The existing common text fingerprint algorithm comprises the following steps:
1. simhash algorithm:
simhash is an algorithm used by google for processing massive text deduplication, and is also an algorithm based on LSH (locality sensitive hashing). The local sensitive hash can obtain similar hash values from the similar character string hashes, so that similar items are more likely to be hashed into one bucket than dissimilar items, and documents in the same bucket are hashed into candidate pairs. Thus, the similarity judgment and deduplication problems can be solved in a time close to linearity. The simhash algorithm calculates the hash value of each feature (keyword) and finally combines the hash values into a feature value, namely a fingerprint.
2. K-Shift Algorithm:
the core idea of K-Shingle is to convert the document similarity problem into a collective similarity problem. A document can be regarded as a character string, k-shifts of the document are all substrings with the length of k in the document, and any document can be represented as a set of k-shifts. For a piece of text, the word segmentation vector is [ w1, w2, w3, w4, \8230; wn ], and k =3, then the shift vector of the text is represented as [ (w 1, w2, w 3), (w 2, w3, w 4), (w 3, w4, w 5), \8230; (wn-2, wn-1, wn) ], and the similarity (jarccard coefficient) of the shift vectors of two texts is calculated to judge whether the text is repeated.
However, the Simhash algorithm is relatively efficient and is relatively suitable for long texts, but the Simhash algorithm does not consider the de-emphasis granularity and the word order, and may cause accuracy problems in the face of high precision, especially a high false alarm rate for short texts. The K-shift algorithm has higher accuracy, but is more resource-consuming in comparison due to the huge shift vector space of the K-shift algorithm (especially when K is particularly large).
Disclosure of Invention
In view of this, the present invention provides a method and a system for comparing similarity of short text sets based on an improved text fingerprint algorithm, which can improve the accuracy of comparing similarity of interface protocol texts.
The invention is realized by adopting the following scheme: a short text set similarity comparison method based on an improved text fingerprint algorithm specifically comprises the following steps:
performing word segmentation processing on each text to obtain a word set of each text;
filtering stop words of the word set of each text;
dynamically setting a K value for each text, and extracting K-shift from the filtered stop word set to obtain a K-shift set of each text;
and calculating the similarity between the two texts according to the K-Shift set of each text.
Further, the word segmentation processing on each text to obtain a word set of each text specifically includes: and taking the Chinese word as the minimum word segmentation unit, and performing word segmentation on each text in the preprocessed short text set to obtain a word set of each text.
Further, the value K is dynamically set, and K-shift is extracted from the filtered word set of stop words, specifically, the number of words in the word set is set to be M, shift extraction is performed from K =1 to K = M, and all results are combined into one set, that is, the set is the K-shift set of the text.
Further, the calculating the similarity between the two texts according to the K-shift set of each text specifically includes the following steps:
step S1: forming a phrase library with the size of N by K-shift different values in the K-shift set of all texts; coding each text in a one-hot mode to respectively obtain a feature vector with the length of N, wherein when the nth K-shift in the phrase library appears in a document, the nth element of the feature vector of the document is 1, otherwise, the nth element of the feature vector of the document is 0;
step S2: and calculating the similarity of Jaccard between the feature vectors of the two texts, and comparing the similarity with a preset similarity threshold value to judge whether the two texts are similar.
The invention also provides a short text set similarity comparison system based on an improved text fingerprinting algorithm, comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, which when executed by the processor implements the method steps as described above.
The invention also provides a computer-readable storage medium, on which a computer program is stored which can be executed by a processor, which, when executing the computer program, carries out the method steps as described above.
Compared with the prior art, the invention has the following beneficial effects: the invention does not adopt a fixed K value, but takes the K value from 1 \ 8230M (M is the total number of words of the text), thereby avoiding the problem of taking the K value. Meanwhile, under the condition that the number of texts is controllable, the accuracy of similarity comparison can be improved.
Drawings
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention.
Detailed Description
The invention is further explained by the following embodiments in conjunction with the drawings.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The method mainly solves the problem that how to select a proper text fingerprint algorithm based on the particularity of the interface protocol text in the interface protocol duplicate removal comparison or similar specific use scenes so as to realize efficient and accurate interface protocol similarity comparison. The interface protocol texts are characterized in that the interface protocol texts are a set of short texts, and the length of each short text generally does not exceed 10 characters, so that the text characteristics need to be considered, and the existing algorithm needs to be improved to optimize the suitable algorithm.
As shown in fig. 1, the present embodiment provides a short text set similarity comparison method based on an improved text fingerprint algorithm, which specifically includes the following steps:
performing word segmentation processing on each text to obtain a word set of each text;
filtering stop words of the word set of each text;
dynamically setting a K value for each text, and extracting K-shift from the filtered stop word set to obtain a K-shift set of each text;
and calculating the similarity between the two texts according to the K-Shift set of each text.
In this embodiment, the performing word segmentation processing on each text to obtain a word set of each text specifically includes: and taking the Chinese word as the minimum word segmentation unit, and performing word segmentation on each text in the preprocessed short text set to obtain a word set of each text. If the interface protocol original text set is as follows:
{ { mobile phone number of user }, { home city }, { product coding }, { acceptance type }, { validation time }, { expiration time }, { whether to send short message notification }
The result after the word segmentation processing is:
the method comprises the following steps of { { a mobile phone number of a user }, { a home city }, { a product code }, { an acceptance type }, { an effective time }, { an invalid time }, and { whether to send a short message notification } }.
Wherein the pretreatment specifically comprises the following steps: reading a document data set, and cleaning and sorting punctuation, blank, special characters, chinese and English, simple and complex characters and other characters in the document according to requirements.
The words after word segmentation are filtered to filter out nonsensical words such as nonsensical words, meaning words and non-entity words, and the result after filtering is as follows:
the method comprises the following steps of { { user mobile phone number }, { home city }, { product coding }, { acceptance type }, { take-in time }, { expiration time }, and { whether to send a short message notification } }.
K-shift refers to any K characters appearing in a document continuously, the only hyper-parameter required to be specified by the K-shift algorithm is the number K of continuous words contained in the shift, K of the classical K-shift algorithm is generally a fixed constant value, and the parameter has two main effects:
(1) Capture of semantic features of documents by Shingle.
The smaller K, such as K =1, the higher the probability of repeated K-length characters appearing in the text, and the poorer the capturing capability of K-shift on document semantics, the more easily the calculation result tends to judge the high similarity of the text. On the contrary, as K is increased, the combination of different vocabularies can reflect more and more features of semantic levels, and the more difficult the calculation result is to judge the similarity as high.
(2) The impact of shift on memory space and computational efficiency.
In short, the larger K is, the higher the feature vector dimension is, the larger the required storage space is, and the lower the calculation efficiency of the distance between the feature dimensions is.
Therefore, for the classic K-shift algorithm, it is important to select the K value largely, and the selection of the K value depends on the typical length of the document and the typical size of the character table, generally, the shorter the text, the smaller the K value may be, and the longer the text, the larger the K value is, and according to experience, the result is more ideal when the short text K = 3-5, and the long text K = 10.
The present embodiment defines a usage scenario of interface protocol similarity comparison or the like, and considers the specificity of the interface protocol: (1) The number of interface protocols is limited, and generally can not exceed 1 ten thousand for a single capability open platform. (2) The text length of each field of the interface protocol is limited, mostly within 10 characters. Based on the above two preconditions, in order to ensure accuracy and calculation efficiency at the same time, the shift algorithm with a fixed K value is not adopted in the embodiment, but a dynamic K value is adopted for implementation. The method comprises the following steps:
in this embodiment, the value K is dynamically set, and K-shift is extracted from the filtered word set of stop words, specifically, the number of words in the word set is set to M, shift extraction is performed from K =1 to K = M, and all results are combined into one set, that is, the set of K-shift of the text. Specific examples are as follows:
for { { user mobile phone number }, { home city }, { product coding }, { accept type }, { validation time }, { expiration time }, { whether to send short message notification } }, the result of K-shift is:
{ user, mobile phone, number, user's mobile phone, mobile phone number, user's mobile phone number, affiliation, city, affiliation city, product, code, product code, accept, type, accept type, validate, time, validation time, invalidation time, whether, issuing, short message, notification, whether to issue, short message issuing, short message notification, whether to issue short message notification }.
In this embodiment, the calculating the similarity between two texts according to the K-shift set of each text specifically includes the following steps:
step S1: forming a phrase library with the size of N by K-shift different values in the K-shift set of all texts; coding each text in a one-hot mode to respectively obtain a feature vector with the length of N, wherein when the nth K-shift in the phrase library appears in the document, the nth element of the feature vector of the document is 1, otherwise, the nth element of the feature vector of the document is 0;
step S2: and calculating the similarity of Jaccard between the feature vectors of the two texts, and comparing the similarity with a preset similarity threshold value to judge whether the two texts are similar.
Wherein, for set A and set B, jaccard similarity is defined as the proportion of intersection elements in union elements, that is
Figure GDA0003833855690000081
It is clear that the larger the scale, the more similar the set. For the feature vectors of the two documents, the numerator refers to the number of elements whose parity elements are both 1, and the denominator refers to the number of elements whose parity elements are at least one 1.
The embodiment also provides a short text set similarity comparison system based on an improved text fingerprint algorithm, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the computer program is executed by the processor to implement the method steps as described above.
The present embodiment also provides a computer-readable storage medium having stored thereon a computer program capable of being executed by a processor, which when executing the computer program performs the method steps as described above.
Since the length of each text field may vary from one word to tens of words in the application scenario of similar interface protocol comparison, a valid K value cannot be selected if the conventional K-shift algorithm is employed. The embodiment does not adopt a fixed K value, but takes the K value as 1 \ 8230M (M is the total number of words of the text), thereby avoiding the problem of taking the K value. Meanwhile, under the condition that the number of texts is controllable, the accuracy of similarity comparison can be improved. The method of the embodiment can be better applied to the interface protocol scene.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims (4)

1. A short text set similarity comparison method based on an improved text fingerprint algorithm is characterized by comprising the following steps:
performing word segmentation processing on each text to obtain a word set of each text;
filtering stop words of the word set of each text;
dynamically setting a K value for each text, and extracting K-shift from the filtered stop word set to obtain a K-shift set of each text;
calculating the similarity between the two texts according to the K-Shift set of each text;
the method comprises the steps of dynamically setting a K value, extracting K-shift from a word set after stop words are filtered, specifically, setting the number of words in the word set as M, carrying out shift extraction from K =1 to K = M, and combining all results into a set, namely the K-shift set of the text;
k refers to the number K of continuous words contained in Shingle;
calculating the similarity between two texts according to the K-Shift set of each text specifically comprises the following steps:
step S1: forming a phrase library with the size of N by K-shift different values in the K-shift set of all texts; coding each text in a one-hot mode to respectively obtain a feature vector with the length of N, wherein when the nth K-shift in the phrase library appears in the document, the nth element of the feature vector of the document is 1, otherwise, the nth element of the feature vector of the document is 0;
step S2: and calculating the similarity of Jaccard between the feature vectors of the two texts, and comparing the similarity with a preset similarity threshold value to judge whether the two texts are similar.
2. The method for comparing similarity of short text sets based on the improved text fingerprint algorithm according to claim 1, wherein the step of performing word segmentation processing on each text to obtain the word set of each text specifically comprises the steps of: and taking the Chinese word as the minimum word segmentation unit, and performing word segmentation on each text in the preprocessed short text set to obtain a word set of each text.
3. A short text set similarity comparison system based on an improved text fingerprinting algorithm, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program, when executed by the processor, implements the method according to any of claims 1-2.
4. A computer-readable storage medium, on which a computer program is stored which can be executed by a processor, characterized in that the processor, when executing the computer program, implements the method according to any of claims 1-2.
CN201911401853.5A 2019-12-31 2019-12-31 Short text set similarity comparison method and system based on text fingerprint algorithm Active CN111159996B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911401853.5A CN111159996B (en) 2019-12-31 2019-12-31 Short text set similarity comparison method and system based on text fingerprint algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911401853.5A CN111159996B (en) 2019-12-31 2019-12-31 Short text set similarity comparison method and system based on text fingerprint algorithm

Publications (2)

Publication Number Publication Date
CN111159996A CN111159996A (en) 2020-05-15
CN111159996B true CN111159996B (en) 2023-03-24

Family

ID=70559298

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911401853.5A Active CN111159996B (en) 2019-12-31 2019-12-31 Short text set similarity comparison method and system based on text fingerprint algorithm

Country Status (1)

Country Link
CN (1) CN111159996B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8122032B2 (en) * 2007-07-20 2012-02-21 Google Inc. Identifying and linking similar passages in a digital text corpus
CN102024065B (en) * 2011-01-18 2013-01-02 中南大学 SIMD optimization-based webpage duplication elimination and concurrency method
CN103646029B (en) * 2013-11-04 2017-03-15 北京中搜网络技术股份有限公司 A kind of similarity calculating method for blog article

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
搜索引擎重复网页发现技术分析;is_153723;《https://m.ishare.iask.sina.com.cn/f/iU7sqaxm3z.html》;20171027;第1-11页 *
文本去重算法:Minhash/Simhash/Klongsent;剪水作花飞;《https://zhuanlan.zhihu.com/p/43640234》;20190218;第1-6页 *

Also Published As

Publication number Publication date
CN111159996A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
CN109241274B (en) Text clustering method and device
CN107229668B (en) Text extraction method based on keyword matching
WO2016180268A1 (en) Text aggregate method and device
CN111460153B (en) Hot topic extraction method, device, terminal equipment and storage medium
US8577155B2 (en) System and method for duplicate text recognition
CN107463548B (en) Phrase mining method and device
JP2018518788A (en) Web page training method and apparatus, search intention identification method and apparatus
CN108710611B (en) Short text topic model generation method based on word network and word vector
CN110413787B (en) Text clustering method, device, terminal and storage medium
WO2023029356A1 (en) Sentence embedding generation method and apparatus based on sentence embedding model, and computer device
CN109993216B (en) Text classification method and device based on K nearest neighbor KNN
CN111291177A (en) Information processing method and device and computer storage medium
CN112883734B (en) Block chain security event public opinion monitoring method and system
CN106569989A (en) De-weighting method and apparatus for short text
CN110929145A (en) Public opinion analysis method, public opinion analysis device, computer device and storage medium
CN108875050B (en) Text-oriented digital evidence-obtaining analysis method and device and computer readable medium
CN110489759B (en) Text feature weighting and short text similarity calculation method, system and medium based on word frequency
CN110399464B (en) Similar news judgment method and system and electronic equipment
CN111159996B (en) Short text set similarity comparison method and system based on text fingerprint algorithm
CN116561298A (en) Title generation method, device, equipment and storage medium based on artificial intelligence
CN114792092B (en) Text theme extraction method and device based on semantic enhancement
Zhang et al. Effective and fast near duplicate detection via signature-based compression metrics
CN111310176B (en) Intrusion detection method and device based on feature selection
CN113536779B (en) Trending topic data processing method and device based on document titles and electronic equipment
CN116136866B (en) Knowledge graph-based correction method and device for Chinese news abstract factual knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant