CN111159996B

CN111159996B - Short text set similarity comparison method and system based on text fingerprint algorithm

Info

Publication number: CN111159996B
Application number: CN201911401853.5A
Authority: CN
Inventors: 邱平
Original assignee: Fujian Funo Mobile Communication Technology Co ltd
Current assignee: Fujian Funo Mobile Communication Technology Co ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-03-24
Anticipated expiration: 2039-12-31
Also published as: CN111159996A

Abstract

The invention relates to a short text set similarity comparison method and a system based on an improved text fingerprint algorithm, wherein word segmentation processing is firstly carried out on each text to obtain a word set of each text; then, filtering stop words of the word set of each text; then dynamically setting a K value for each text, and extracting K-shift from the filtered word set of stop words to obtain a K-shift set of each text; and finally, calculating the similarity between the two texts according to the K-Shift set of each text. The invention can improve the accuracy of the interface protocol text similarity comparison.

Description

Short text set similarity comparison method and system based on text fingerprint algorithm

Technical Field

The invention relates to the technical field of computer text information processing, in particular to a short text set similarity comparison method and system based on an improved text fingerprint algorithm.

Background

In the internet era, a large amount of repeated content and information are enriched on the network, and no matter the de-duplication and filtering of a search engine, the de-duplication and anti-piracy of a media platform and the like, the similarity comparison of a large amount of text information needs to be efficiently and accurately carried out.

The existing typical text deduplication method adopts a fingerprint algorithm, namely, firstly performing word segmentation on a text, then calculating TF-IDF of a document, sorting according to TF-IDF, extracting words which are sorted at the front as feature words, constructing a fingerprint for each text by using a HASH function or other rules to serve as an identifier of the text, and judging the duplication degree of text information according to the fingerprint.

The existing common text fingerprint algorithm comprises the following steps:

1. simhash algorithm:

simhash is an algorithm used by google for processing massive text deduplication, and is also an algorithm based on LSH (locality sensitive hashing). The local sensitive hash can obtain similar hash values from the similar character string hashes, so that similar items are more likely to be hashed into one bucket than dissimilar items, and documents in the same bucket are hashed into candidate pairs. Thus, the similarity judgment and deduplication problems can be solved in a time close to linearity. The simhash algorithm calculates the hash value of each feature (keyword) and finally combines the hash values into a feature value, namely a fingerprint.

2. K-Shift Algorithm:

the core idea of K-Shingle is to convert the document similarity problem into a collective similarity problem. A document can be regarded as a character string, k-shifts of the document are all substrings with the length of k in the document, and any document can be represented as a set of k-shifts. For a piece of text, the word segmentation vector is [ w1, w2, w3, w4, \8230; wn ], and k =3, then the shift vector of the text is represented as [ (w 1, w2, w 3), (w 2, w3, w 4), (w 3, w4, w 5), \8230; (wn-2, wn-1, wn) ], and the similarity (jarccard coefficient) of the shift vectors of two texts is calculated to judge whether the text is repeated.

However, the Simhash algorithm is relatively efficient and is relatively suitable for long texts, but the Simhash algorithm does not consider the de-emphasis granularity and the word order, and may cause accuracy problems in the face of high precision, especially a high false alarm rate for short texts. The K-shift algorithm has higher accuracy, but is more resource-consuming in comparison due to the huge shift vector space of the K-shift algorithm (especially when K is particularly large).

Disclosure of Invention

In view of this, the present invention provides a method and a system for comparing similarity of short text sets based on an improved text fingerprint algorithm, which can improve the accuracy of comparing similarity of interface protocol texts.

The invention is realized by adopting the following scheme: a short text set similarity comparison method based on an improved text fingerprint algorithm specifically comprises the following steps:

performing word segmentation processing on each text to obtain a word set of each text;

filtering stop words of the word set of each text;

dynamically setting a K value for each text, and extracting K-shift from the filtered stop word set to obtain a K-shift set of each text;

and calculating the similarity between the two texts according to the K-Shift set of each text.

Further, the word segmentation processing on each text to obtain a word set of each text specifically includes: and taking the Chinese word as the minimum word segmentation unit, and performing word segmentation on each text in the preprocessed short text set to obtain a word set of each text.

Further, the value K is dynamically set, and K-shift is extracted from the filtered word set of stop words, specifically, the number of words in the word set is set to be M, shift extraction is performed from K =1 to K = M, and all results are combined into one set, that is, the set is the K-shift set of the text.

Further, the calculating the similarity between the two texts according to the K-shift set of each text specifically includes the following steps:

step S1: forming a phrase library with the size of N by K-shift different values in the K-shift set of all texts; coding each text in a one-hot mode to respectively obtain a feature vector with the length of N, wherein when the nth K-shift in the phrase library appears in a document, the nth element of the feature vector of the document is 1, otherwise, the nth element of the feature vector of the document is 0;

step S2: and calculating the similarity of Jaccard between the feature vectors of the two texts, and comparing the similarity with a preset similarity threshold value to judge whether the two texts are similar.

The invention also provides a short text set similarity comparison system based on an improved text fingerprinting algorithm, comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, which when executed by the processor implements the method steps as described above.

The invention also provides a computer-readable storage medium, on which a computer program is stored which can be executed by a processor, which, when executing the computer program, carries out the method steps as described above.

Compared with the prior art, the invention has the following beneficial effects: the invention does not adopt a fixed K value, but takes the K value from 1 \ 8230M (M is the total number of words of the text), thereby avoiding the problem of taking the K value. Meanwhile, under the condition that the number of texts is controllable, the accuracy of similarity comparison can be improved.

Drawings

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention.

Detailed Description

The invention is further explained by the following embodiments in conjunction with the drawings.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The method mainly solves the problem that how to select a proper text fingerprint algorithm based on the particularity of the interface protocol text in the interface protocol duplicate removal comparison or similar specific use scenes so as to realize efficient and accurate interface protocol similarity comparison. The interface protocol texts are characterized in that the interface protocol texts are a set of short texts, and the length of each short text generally does not exceed 10 characters, so that the text characteristics need to be considered, and the existing algorithm needs to be improved to optimize the suitable algorithm.

As shown in fig. 1, the present embodiment provides a short text set similarity comparison method based on an improved text fingerprint algorithm, which specifically includes the following steps:

filtering stop words of the word set of each text;

In this embodiment, the performing word segmentation processing on each text to obtain a word set of each text specifically includes: and taking the Chinese word as the minimum word segmentation unit, and performing word segmentation on each text in the preprocessed short text set to obtain a word set of each text. If the interface protocol original text set is as follows:

{ { mobile phone number of user }, { home city }, { product coding }, { acceptance type }, { validation time }, { expiration time }, { whether to send short message notification }

The result after the word segmentation processing is:

the method comprises the following steps of { { a mobile phone number of a user }, { a home city }, { a product code }, { an acceptance type }, { an effective time }, { an invalid time }, and { whether to send a short message notification } }.

Wherein the pretreatment specifically comprises the following steps: reading a document data set, and cleaning and sorting punctuation, blank, special characters, chinese and English, simple and complex characters and other characters in the document according to requirements.

The words after word segmentation are filtered to filter out nonsensical words such as nonsensical words, meaning words and non-entity words, and the result after filtering is as follows:

the method comprises the following steps of { { user mobile phone number }, { home city }, { product coding }, { acceptance type }, { take-in time }, { expiration time }, and { whether to send a short message notification } }.

K-shift refers to any K characters appearing in a document continuously, the only hyper-parameter required to be specified by the K-shift algorithm is the number K of continuous words contained in the shift, K of the classical K-shift algorithm is generally a fixed constant value, and the parameter has two main effects:

(1) Capture of semantic features of documents by Shingle.

The smaller K, such as K =1, the higher the probability of repeated K-length characters appearing in the text, and the poorer the capturing capability of K-shift on document semantics, the more easily the calculation result tends to judge the high similarity of the text. On the contrary, as K is increased, the combination of different vocabularies can reflect more and more features of semantic levels, and the more difficult the calculation result is to judge the similarity as high.

(2) The impact of shift on memory space and computational efficiency.

In short, the larger K is, the higher the feature vector dimension is, the larger the required storage space is, and the lower the calculation efficiency of the distance between the feature dimensions is.

Therefore, for the classic K-shift algorithm, it is important to select the K value largely, and the selection of the K value depends on the typical length of the document and the typical size of the character table, generally, the shorter the text, the smaller the K value may be, and the longer the text, the larger the K value is, and according to experience, the result is more ideal when the short text K = 3-5, and the long text K = 10.

The present embodiment defines a usage scenario of interface protocol similarity comparison or the like, and considers the specificity of the interface protocol: (1) The number of interface protocols is limited, and generally can not exceed 1 ten thousand for a single capability open platform. (2) The text length of each field of the interface protocol is limited, mostly within 10 characters. Based on the above two preconditions, in order to ensure accuracy and calculation efficiency at the same time, the shift algorithm with a fixed K value is not adopted in the embodiment, but a dynamic K value is adopted for implementation. The method comprises the following steps:

in this embodiment, the value K is dynamically set, and K-shift is extracted from the filtered word set of stop words, specifically, the number of words in the word set is set to M, shift extraction is performed from K =1 to K = M, and all results are combined into one set, that is, the set of K-shift of the text. Specific examples are as follows:

for { { user mobile phone number }, { home city }, { product coding }, { accept type }, { validation time }, { expiration time }, { whether to send short message notification } }, the result of K-shift is:

{ user, mobile phone, number, user's mobile phone, mobile phone number, user's mobile phone number, affiliation, city, affiliation city, product, code, product code, accept, type, accept type, validate, time, validation time, invalidation time, whether, issuing, short message, notification, whether to issue, short message issuing, short message notification, whether to issue short message notification }.

In this embodiment, the calculating the similarity between two texts according to the K-shift set of each text specifically includes the following steps:

step S1: forming a phrase library with the size of N by K-shift different values in the K-shift set of all texts; coding each text in a one-hot mode to respectively obtain a feature vector with the length of N, wherein when the nth K-shift in the phrase library appears in the document, the nth element of the feature vector of the document is 1, otherwise, the nth element of the feature vector of the document is 0;

Wherein, for set A and set B, jaccard similarity is defined as the proportion of intersection elements in union elements, that is

It is clear that the larger the scale, the more similar the set. For the feature vectors of the two documents, the numerator refers to the number of elements whose parity elements are both 1, and the denominator refers to the number of elements whose parity elements are at least one 1.

The embodiment also provides a short text set similarity comparison system based on an improved text fingerprint algorithm, which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the computer program is executed by the processor to implement the method steps as described above.

The present embodiment also provides a computer-readable storage medium having stored thereon a computer program capable of being executed by a processor, which when executing the computer program performs the method steps as described above.

Since the length of each text field may vary from one word to tens of words in the application scenario of similar interface protocol comparison, a valid K value cannot be selected if the conventional K-shift algorithm is employed. The embodiment does not adopt a fixed K value, but takes the K value as 1 \ 8230M (M is the total number of words of the text), thereby avoiding the problem of taking the K value. Meanwhile, under the condition that the number of texts is controllable, the accuracy of similarity comparison can be improved. The method of the embodiment can be better applied to the interface protocol scene.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A short text set similarity comparison method based on an improved text fingerprint algorithm is characterized by comprising the following steps:

filtering stop words of the word set of each text;

calculating the similarity between the two texts according to the K-Shift set of each text;

the method comprises the steps of dynamically setting a K value, extracting K-shift from a word set after stop words are filtered, specifically, setting the number of words in the word set as M, carrying out shift extraction from K =1 to K = M, and combining all results into a set, namely the K-shift set of the text;

k refers to the number K of continuous words contained in Shingle;

calculating the similarity between two texts according to the K-Shift set of each text specifically comprises the following steps:

2. The method for comparing similarity of short text sets based on the improved text fingerprint algorithm according to claim 1, wherein the step of performing word segmentation processing on each text to obtain the word set of each text specifically comprises the steps of: and taking the Chinese word as the minimum word segmentation unit, and performing word segmentation on each text in the preprocessed short text set to obtain a word set of each text.

3. A short text set similarity comparison system based on an improved text fingerprinting algorithm, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program, when executed by the processor, implements the method according to any of claims 1-2.

4. A computer-readable storage medium, on which a computer program is stored which can be executed by a processor, characterized in that the processor, when executing the computer program, implements the method according to any of claims 1-2.