CN113609247A

CN113609247A - Big data text duplicate removal technology based on improved Simhash algorithm

Info

Publication number: CN113609247A
Application number: CN202110917142.4A
Authority: CN
Inventors: 梁超; 张宇
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-11-05

Abstract

The invention discloses a big data text duplicate removal technology based on an improved Simhash algorithm, which relates to the field of natural language processing and comprises the following steps: (1) performing word segmentation by adopting a word segmentation tool; (2) giving corresponding weight to the good keywords; (3) calculating a document content signature and an article abstract signature through the keyword weight; (4) the calculation finds similar documents. The invention provides an improved Simhash algorithm for removing the duplication of the big data text based on the classic Simhash algorithm. Firstly, a better word segmentation tool is selected, word segmentation is more accurate, the part of speech and word length are considered in the weight calculation stage, and secondary hash is carried out by adopting the barrel sorting idea in the signature value matching stage. And finally, providing a brand new calculation formula for calculating the Hamming distance comparison of the Simhash signature values according to the feature vectors of the article content and the abstract content. The method is very suitable for the work of removing the duplicate of the big data text, improves the accuracy and recall rate, and also improves the duplicate removal speed.

Description

Big data text duplicate removal technology based on improved Simhash algorithm

Technical Field

The invention discloses a big data text duplicate removal technology based on an improved Simhash algorithm, and relates to the field of natural language processing.

Background

Since the 21 st century, human activities generate a large amount of data, and the development of networks and big data also enables more and more researchers to research the big data, when the big data is researched, a large amount of data should be preprocessed firstly, and a data deduplication technology is the first step of data preprocessing. By the technology, a large amount of repeated data can be removed, so that the data query speed can be greatly increased, the storage space is reduced, and the storage expense is saved. The repeated data deduplication technology can find out and remove repeated parts in data, transmit and store deduplication result data, and use a pointer to point a stored data object to repeated data, so that the purpose of deleting the repeated data and even only one identical data document is achieved, and storage space is saved. In general, a signature value of document data can be calculated using a hash function, but using a general hash function has a problem of collision in that the same hash signature value appears even in different documents. Aiming at the problems of the existing data deduplication algorithm, the invention introduces a TF-IDF technology for calculating the characteristic weight based on the word part of speech and the word length, and improves the precision of the finally generated Simhash signature value. Then, the retrieval process of the signature values is improved, the distribution of the signature values is uniform, the retrieval efficiency is improved, then a brand new signature value calculation formula is provided, the final signature value of the article is calculated, and the similarity degree of the documents is compared. Technical support is provided for the de-duplication of the big data text.

Disclosure of Invention

The invention discloses a big data text deduplication technology based on an improved Simhash algorithm, aiming at solving the problems of redundancy, repetition and the like of a large amount of data generated in the current big data era.

Therefore, the invention provides the following technical scheme:

1. a big data text duplicate removal technology based on an improved Simhash algorithm mainly comprises the following steps:

(1) performing word segmentation by adopting a word segmentation tool;

(2) giving corresponding weight to the good keywords;

(3) calculating a document content signature and an article abstract signature through the keyword weight;

(4) the calculation finds similar documents.

2. The big data text deduplication technology based on the modified Simhash algorithm as claimed in claim 1, wherein: in the step (1), an NLPIR-ICTCCLAS word segmentation System developed by the Institute of computer research of Chinese academy of sciences over a decade is selected, and on the basis of the ICTCCLAS (Institute of Computing Technology, Chinese legacy Analysis System) word segmentation System, the kernel is upgraded for more than 10 times, and the number of users exceeds 30 ten thousand. Powerful functions, mainly including Chinese and English word segmentation; part of speech tagging; named entity recognition; recognizing new words; the speed and the precision of extracting the keywords and segmenting the words are greatly improved.

3. The big data text deduplication technology based on the modified Simhash algorithm as claimed in claim 1, wherein: in the step (2), the TF-IDF algorithm is improved, the inverse file frequency is not completely used as a judgment standard of the keyword weight, and meanwhile, a judgment standard of the part of speech and the length of the keyword is introduced, so that the more important keyword after word segmentation has higher weight.

4. The big data text deduplication technology based on the modified Simhash algorithm as claimed in claim 1, wherein: in the step (3), an article abstract is generated according to the generated keyword word frequency and the first sentence of the text segment, then the signature values of the article abstract and the article content are respectively calculated, the signature value calculation adopts a signature value calculation method of a traditional Simhash algorithm, the divided keyword weight characteristic values are hashed, the hash result is weighted in dimension reduction, and finally the signature values of the article abstract and the article content are obtained.

5. The big data text deduplication technology based on the modified Simhash algorithm as claimed in claim 1, wherein: in the step (4), a brand-new document signature value calculation method is provided, the signature values of the document contents and the signature values of the article summaries are compared with each other for hamming distance, and in the process of comparing the signature values, the method is also optimized, the data distribution is more uniform by adopting secondary hash, and finally the final hamming distance of the two documents is obtained.

Improvements in or relating to

1. The invention provides a big data text deduplication technology based on an improved Simhash algorithm, and provides a more efficient and accurate deduplication method for a large amount of repeated data generated in the current big data era.

2. The invention changes the word segmentation technology of the traditional Simhash algorithm and the mode of calculating the weight of word frequency, adopts an NLPIR-ICTCCLAS word segmentation system developed by the institute of computer research of Chinese academy of sciences, is quicker and more accurate in the field of Chinese word segmentation, and introduces the word frequency and the word property of keywords as parameters in the process of calculating the weight of the keywords, so that the signature value is more accurately calculated.

3. The invention provides a new improved method for matching a Simhash signature value, which is to judge whether the distribution of the signature value in each barrel is uniform, carry out the second hash operation on the hashed signature value data of non-uniform barrels, and compare the hamming distance of the hashed result in the data in one barrel, thus increasing the occupied space of the algorithm, but reducing the comparison times of the hash value and improving the algorithm efficiency.

4. The invention provides a brand-new document signature value calculation method according to the concept of an article abstract, wherein the article abstract is generated through keyword word frequency and a text segment first sentence. And then, the improved TF-IDF algorithm is used for respectively calculating the subject term weight of the document content and the article abstract. Respectively inputting improved Simhash signature value calculation methods to calculate signature values belonging to document contents and article summaries, then comparing the Hamming distance between the signature value of the document contents and the signature value of the article summaries, and finally obtaining the final Hamming distance between the two documents, so that the duplicate removal result is more accurate.

Drawings

Fig. 1 is a flowchart of a big data text deduplication technology based on an improved Simhash algorithm in an embodiment of the present invention.

FIG. 2 is a diagram showing the effect of fingerprint weight change on algorithm effect.

FIG. 3 is a graph of accuracy versus results in an embodiment of the present invention.

FIG. 4 is a chart showing the comparison of recall ratios according to the embodiment of the present invention.

FIG. 5 is a graph showing the comparison result of the execution times in the embodiment of the present invention.

The specific implementation mode is as follows:

in order to clearly and completely describe the technical solutions in the embodiments of the present invention, the present invention is further described in detail below with reference to the drawings in the embodiments.

The embodiment of the invention is based on a flow of a big data text deduplication technology of an improved Simhash algorithm, and comprises the following steps as shown in FIG. 1.

The process of step 1 obtaining the repeated text data set is as follows:

in dog search news data: https:// www.sogou.com/labs/resource/ca. php, 5000 pieces of Chinese news text data, ten categories: 'automobile', 'finance', 'science', 'health', 'sports', 'education', 'culture', 'military', 'entertainment', 'fashion', each 500 pieces of similar data, and 2000 pieces of unrelated data were mixed.

Step 2, the text set carries out word segmentation and the calculation process of the feature weight is as follows:

step 2-1, an NLPIR-ICTCCLAS word segmentation system which is researched and developed by the institute of computer research of Chinese academy of sciences for more than ten years is selected, and the word segmentation system is powerful and mainly comprises Chinese and English word segmentation; part of speech tagging; named entity recognition; recognizing new words; and (5) extracting keywords. The Chinese word segmentation tool is the most famous Chinese word segmentation tool with the most people number. The word segmentation system also has a high-efficiency word extraction function, can automatically filter the infrequent words, and greatly improves the preprocessing speed and precision of mass data;

step 2-2, for the TF-IDF algorithm for calculating the word frequency weight, the weight of the keywords is calculated only according to the occurrence frequency of the keywords, although most useless words are removed from the keyword entries processed by the word segmentation tool, the keywords with more occurrence frequencies are not important when the reverse file frequency is calculated, and the keywords with less occurrence frequencies are important.

Therefore, the TF-IDF algorithm is improved, the inverse file frequency is not completely used as the judgment standard of the keyword weight, and the judgment standard of the part of speech and the length of the keyword is introduced, so that the more important keyword after word segmentation has higher weight. The first is an improvement in terms of part-of-speech, which refers to the attributes of keywords, such as in a sentence where the subject must be the most important, since the sentence is being expanded around the subject, usually the subject of a sentence is expressed by nouns. The predicate is a state or action that expresses the subject, is of importance next to the subject, and is generally expressed by a verb. Objects are generally interpretations and complements of subjects, and include nouns and adjectives. There are some other structures, but none are important parts in a sentence. The nouns and verbs can be given higher weight, for example, a sentence "method of algorithm optimization", followed by four words "algorithm", "optimization", "method" and "after word segmentation. It appears once in terms of frequency of occurrence, but the "algorithm" is certainly more important in this context, so a higher weight should be given to characterize this context. And such keyword weights may be more accurate in computing the Simhash signature value. Thus, the word weight table 1 is obtained from the above-described demonstration.

TABLE 1

Part of speech	Weight of
		Noun (name)	3
Verb and its usage	2
		Adjectives	1
Others	0

Secondly, improvement in term length is carried out, according to 2017 publication industry research keyword inventory, namely statistics is carried out on term length of keywords in analysis based on 6 CSSCI publication academic journals, the keywords with 3 to 5 characters are the most when the keywords are found, and therefore, the keywords with more than three characters are given higher weight after word segmentation. The formula for calculating the weight of the keyword becomes:

W＝tf_ij·idf_i+λ(w_ni+Len(w_i))

where λ is a parameter, Len (w)_i) The value is related to the document length and is defined as follows:

the improved keyword calculation formula is used for accurately calculating the feature weight of the keyword, so that the calculated Simhash signature value can be more accurately calculated.

Step 3, the process of the Simhash signature value retrieval stage is as follows:

the invention provides a new improved method for matching Simhash signature values, which is to judge whether the distribution of the signature values in each barrel is uniform or not and carry out the second hash operation on the hashed signature value data of non-uniform barrels, and comprises the following specific steps:

(1) inputting any document set to calculate, checking whether the distribution of the signature values among the buckets is uniform, and if the buckets are non-uniformly distributed, removing the block which is compared for the first time on the original signature value. The non-uniformity judging method comprises the following steps: first, the data amount in the bucket is judged, and if the data amount is larger than (1+ weight) times the average value of elements in the bucket, the data is considered to be uneven.

(2) And calculating the Hamming distance of other buckets without uneven buckets, and outputting data with the Hamming distance less than 3 as a similarity comparison result of the part of file data sets.

(3) For those non-uniform buckets, the first used block is removed, the remaining signature value bits are used for chunking, and then the remaining signature value blocks are hashed again into the bucket using the same hash function. And mapping the hashed signature value blocks to the signature value blocks in the same bucket for the second time, performing similarity comparison, and calculating the hamming distance.

Step 4, the improved text distance calculation method comprises the following processes:

the traditional computation of the signature value of the Simhash document is to compare the Hamming distance of the contents of the two documents, and in order to ensure that the complexity of the text deduplication time is not influenced too much and improve the precision of the Simhash algorithm, a brand-new computation method of the signature value of the document is provided according to the concept of an article abstract, and the article abstract is generated through the word frequency of keywords and the first sentence of a text segment. And then, the improved TF-IDF algorithm is used for respectively calculating the subject term weight of the document content and the article abstract. Respectively inputting improved Simhash signature value calculation methods to calculate signature values belonging to document contents and article summaries, then comparing the Hamming distance between the signature value of the document contents and the signature value of the article summaries, and finally obtaining the final Hamming distance between the two documents to obtain a repeated document set.

The calculation formula is as follows:

f(A,B)＝μCHaming(A,B)+(1-μ)SHaming(A,B)

wherein f (A, B) is the signature distance of two documents, CHaming (A, B) is the document content Hamming distance of two documents, and SHAMINg (A, B) is the article abstract Hamming distance of two documents. μ is the weight of two distances.

In order to determine the value of μ, it must be obtained by experiment. Therefore, 1100 pieces of news of fox search news are used, wherein the similar news are 100 pieces, the value of mu is from 0 to 1, the value of mu with the step length of 0.01 is used as the abscissa, because in the Simhash algorithm, the article is considered to be repeated when the Hamming distance is less than 3, and the text distance is empirically selected to be 3 according to the signature value of the Simhash algorithm. The calculation results of the accuracy (Precision) and the Recall (Recall Rate) of the algorithm can be obtained as shown in fig. 2: as can be seen from the graph, when the value of μ is 0.7, that is, (1-. mu.) is 0.3, a relatively high accuracy and recall can be obtained. Substituting mu into a Hamming distance calculation formula of the two documents to obtain the Hamming distance f (A, B) of the two documents as follows:

f(A,B)＝0.7CHaming(A,B)+0.3SHaming(A,B)

when f (A, B) is less than 3, the two documents are considered to be repeated.

The experimental results and analytical experimental results of step 5 are as follows:

the accuracy and recall rate of the classical evaluation data deduplication method are selected and defined as follows:

the experimental results are shown in fig. 3 and fig. 4, and the accuracy and the recall ratio obtained by observing the experiment are found to be improved by about 5% compared with the classical Simhash algorithm, because the improved Simhash algorithm not only calculates the document content, but also calculates the TF-IDF algorithm from the word segmentation tool and the characteristic weight, optimizes the signature value in multiple aspects, and finally judges whether the articles are similar or not by combining the article abstract. Therefore, the obtained accuracy and recall rate are high and stable.

The experiment finally compares the running time of the original Simhash algorithm and the improved Simhash algorithm, and as can be seen from FIG. 5, the number of documents is increased by taking seconds as a unit, the improved Simhash execution speed is higher, and because the Simhash signature value matching stage is improved, the Hash calculation is carried out for the second time compared with the condition that the bucket sorting data wasting the running time of the algorithm is not uniform, the signature value comparison times are reduced, and the improved algorithm efficiency is higher.

According to the embodiment of the invention, a big data text deduplication technology based on an improved Simhash algorithm can perform decision support for big data deduplication and data preprocessing.

The above description is for the purpose of describing in detail embodiments of the present invention with reference to the accompanying drawings, and the detailed description is for the purpose of facilitating understanding of the invention.

Claims

(1) performing word segmentation by adopting a word segmentation tool;

(2) giving corresponding weight to the good keywords;

(4) the calculation finds similar documents.