CN113609247A - Big data text duplicate removal technology based on improved Simhash algorithm - Google Patents

Big data text duplicate removal technology based on improved Simhash algorithm Download PDF

Info

Publication number
CN113609247A
CN113609247A CN202110917142.4A CN202110917142A CN113609247A CN 113609247 A CN113609247 A CN 113609247A CN 202110917142 A CN202110917142 A CN 202110917142A CN 113609247 A CN113609247 A CN 113609247A
Authority
CN
China
Prior art keywords
signature
word segmentation
big data
algorithm
data text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110917142.4A
Other languages
Chinese (zh)
Inventor
梁超
张宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin University of Science and Technology
Original Assignee
Harbin University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin University of Science and Technology filed Critical Harbin University of Science and Technology
Priority to CN202110917142.4A priority Critical patent/CN113609247A/en
Publication of CN113609247A publication Critical patent/CN113609247A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The invention discloses a big data text duplicate removal technology based on an improved Simhash algorithm, which relates to the field of natural language processing and comprises the following steps: (1) performing word segmentation by adopting a word segmentation tool; (2) giving corresponding weight to the good keywords; (3) calculating a document content signature and an article abstract signature through the keyword weight; (4) the calculation finds similar documents. The invention provides an improved Simhash algorithm for removing the duplication of the big data text based on the classic Simhash algorithm. Firstly, a better word segmentation tool is selected, word segmentation is more accurate, the part of speech and word length are considered in the weight calculation stage, and secondary hash is carried out by adopting the barrel sorting idea in the signature value matching stage. And finally, providing a brand new calculation formula for calculating the Hamming distance comparison of the Simhash signature values according to the feature vectors of the article content and the abstract content. The method is very suitable for the work of removing the duplicate of the big data text, improves the accuracy and recall rate, and also improves the duplicate removal speed.

Description

Big data text duplicate removal technology based on improved Simhash algorithm
Technical Field
The invention discloses a big data text duplicate removal technology based on an improved Simhash algorithm, and relates to the field of natural language processing.
Background
Since the 21 st century, human activities generate a large amount of data, and the development of networks and big data also enables more and more researchers to research the big data, when the big data is researched, a large amount of data should be preprocessed firstly, and a data deduplication technology is the first step of data preprocessing. By the technology, a large amount of repeated data can be removed, so that the data query speed can be greatly increased, the storage space is reduced, and the storage expense is saved. The repeated data deduplication technology can find out and remove repeated parts in data, transmit and store deduplication result data, and use a pointer to point a stored data object to repeated data, so that the purpose of deleting the repeated data and even only one identical data document is achieved, and storage space is saved. In general, a signature value of document data can be calculated using a hash function, but using a general hash function has a problem of collision in that the same hash signature value appears even in different documents. Aiming at the problems of the existing data deduplication algorithm, the invention introduces a TF-IDF technology for calculating the characteristic weight based on the word part of speech and the word length, and improves the precision of the finally generated Simhash signature value. Then, the retrieval process of the signature values is improved, the distribution of the signature values is uniform, the retrieval efficiency is improved, then a brand new signature value calculation formula is provided, the final signature value of the article is calculated, and the similarity degree of the documents is compared. Technical support is provided for the de-duplication of the big data text.
Disclosure of Invention
The invention discloses a big data text deduplication technology based on an improved Simhash algorithm, aiming at solving the problems of redundancy, repetition and the like of a large amount of data generated in the current big data era.
Therefore, the invention provides the following technical scheme:
1. a big data text duplicate removal technology based on an improved Simhash algorithm mainly comprises the following steps:
(1) performing word segmentation by adopting a word segmentation tool;
(2) giving corresponding weight to the good keywords;
(3) calculating a document content signature and an article abstract signature through the keyword weight;
(4) the calculation finds similar documents.
2. The big data text deduplication technology based on the modified Simhash algorithm as claimed in claim 1, wherein: in the step (1), an NLPIR-ICTCCLAS word segmentation System developed by the Institute of computer research of Chinese academy of sciences over a decade is selected, and on the basis of the ICTCCLAS (Institute of Computing Technology, Chinese legacy Analysis System) word segmentation System, the kernel is upgraded for more than 10 times, and the number of users exceeds 30 ten thousand. Powerful functions, mainly including Chinese and English word segmentation; part of speech tagging; named entity recognition; recognizing new words; the speed and the precision of extracting the keywords and segmenting the words are greatly improved.
3. The big data text deduplication technology based on the modified Simhash algorithm as claimed in claim 1, wherein: in the step (2), the TF-IDF algorithm is improved, the inverse file frequency is not completely used as a judgment standard of the keyword weight, and meanwhile, a judgment standard of the part of speech and the length of the keyword is introduced, so that the more important keyword after word segmentation has higher weight.
4. The big data text deduplication technology based on the modified Simhash algorithm as claimed in claim 1, wherein: in the step (3), an article abstract is generated according to the generated keyword word frequency and the first sentence of the text segment, then the signature values of the article abstract and the article content are respectively calculated, the signature value calculation adopts a signature value calculation method of a traditional Simhash algorithm, the divided keyword weight characteristic values are hashed, the hash result is weighted in dimension reduction, and finally the signature values of the article abstract and the article content are obtained.
5. The big data text deduplication technology based on the modified Simhash algorithm as claimed in claim 1, wherein: in the step (4), a brand-new document signature value calculation method is provided, the signature values of the document contents and the signature values of the article summaries are compared with each other for hamming distance, and in the process of comparing the signature values, the method is also optimized, the data distribution is more uniform by adopting secondary hash, and finally the final hamming distance of the two documents is obtained.
Improvements in or relating to
1. The invention provides a big data text deduplication technology based on an improved Simhash algorithm, and provides a more efficient and accurate deduplication method for a large amount of repeated data generated in the current big data era.
2. The invention changes the word segmentation technology of the traditional Simhash algorithm and the mode of calculating the weight of word frequency, adopts an NLPIR-ICTCCLAS word segmentation system developed by the institute of computer research of Chinese academy of sciences, is quicker and more accurate in the field of Chinese word segmentation, and introduces the word frequency and the word property of keywords as parameters in the process of calculating the weight of the keywords, so that the signature value is more accurately calculated.
3. The invention provides a new improved method for matching a Simhash signature value, which is to judge whether the distribution of the signature value in each barrel is uniform, carry out the second hash operation on the hashed signature value data of non-uniform barrels, and compare the hamming distance of the hashed result in the data in one barrel, thus increasing the occupied space of the algorithm, but reducing the comparison times of the hash value and improving the algorithm efficiency.
4. The invention provides a brand-new document signature value calculation method according to the concept of an article abstract, wherein the article abstract is generated through keyword word frequency and a text segment first sentence. And then, the improved TF-IDF algorithm is used for respectively calculating the subject term weight of the document content and the article abstract. Respectively inputting improved Simhash signature value calculation methods to calculate signature values belonging to document contents and article summaries, then comparing the Hamming distance between the signature value of the document contents and the signature value of the article summaries, and finally obtaining the final Hamming distance between the two documents, so that the duplicate removal result is more accurate.
Drawings
Fig. 1 is a flowchart of a big data text deduplication technology based on an improved Simhash algorithm in an embodiment of the present invention.
FIG. 2 is a diagram showing the effect of fingerprint weight change on algorithm effect.
FIG. 3 is a graph of accuracy versus results in an embodiment of the present invention.
FIG. 4 is a chart showing the comparison of recall ratios according to the embodiment of the present invention.
FIG. 5 is a graph showing the comparison result of the execution times in the embodiment of the present invention.
The specific implementation mode is as follows:
in order to clearly and completely describe the technical solutions in the embodiments of the present invention, the present invention is further described in detail below with reference to the drawings in the embodiments.
The embodiment of the invention is based on a flow of a big data text deduplication technology of an improved Simhash algorithm, and comprises the following steps as shown in FIG. 1.
The process of step 1 obtaining the repeated text data set is as follows:
in dog search news data: https:// www.sogou.com/labs/resource/ca. php, 5000 pieces of Chinese news text data, ten categories: 'automobile', 'finance', 'science', 'health', 'sports', 'education', 'culture', 'military', 'entertainment', 'fashion', each 500 pieces of similar data, and 2000 pieces of unrelated data were mixed.
Step 2, the text set carries out word segmentation and the calculation process of the feature weight is as follows:
step 2-1, an NLPIR-ICTCCLAS word segmentation system which is researched and developed by the institute of computer research of Chinese academy of sciences for more than ten years is selected, and the word segmentation system is powerful and mainly comprises Chinese and English word segmentation; part of speech tagging; named entity recognition; recognizing new words; and (5) extracting keywords. The Chinese word segmentation tool is the most famous Chinese word segmentation tool with the most people number. The word segmentation system also has a high-efficiency word extraction function, can automatically filter the infrequent words, and greatly improves the preprocessing speed and precision of mass data;
step 2-2, for the TF-IDF algorithm for calculating the word frequency weight, the weight of the keywords is calculated only according to the occurrence frequency of the keywords, although most useless words are removed from the keyword entries processed by the word segmentation tool, the keywords with more occurrence frequencies are not important when the reverse file frequency is calculated, and the keywords with less occurrence frequencies are important.
Therefore, the TF-IDF algorithm is improved, the inverse file frequency is not completely used as the judgment standard of the keyword weight, and the judgment standard of the part of speech and the length of the keyword is introduced, so that the more important keyword after word segmentation has higher weight. The first is an improvement in terms of part-of-speech, which refers to the attributes of keywords, such as in a sentence where the subject must be the most important, since the sentence is being expanded around the subject, usually the subject of a sentence is expressed by nouns. The predicate is a state or action that expresses the subject, is of importance next to the subject, and is generally expressed by a verb. Objects are generally interpretations and complements of subjects, and include nouns and adjectives. There are some other structures, but none are important parts in a sentence. The nouns and verbs can be given higher weight, for example, a sentence "method of algorithm optimization", followed by four words "algorithm", "optimization", "method" and "after word segmentation. It appears once in terms of frequency of occurrence, but the "algorithm" is certainly more important in this context, so a higher weight should be given to characterize this context. And such keyword weights may be more accurate in computing the Simhash signature value. Thus, the word weight table 1 is obtained from the above-described demonstration.
TABLE 1
Part of speech Weight of
Noun (name) 3
Verb and its usage 2
Adjectives 1
Others 0
Secondly, improvement in term length is carried out, according to 2017 publication industry research keyword inventory, namely statistics is carried out on term length of keywords in analysis based on 6 CSSCI publication academic journals, the keywords with 3 to 5 characters are the most when the keywords are found, and therefore, the keywords with more than three characters are given higher weight after word segmentation. The formula for calculating the weight of the keyword becomes:
W=tfij·idfi+λ(wni+Len(wi))
where λ is a parameter, Len (w)i) The value is related to the document length and is defined as follows:
Figure BDA0003206035600000041
the improved keyword calculation formula is used for accurately calculating the feature weight of the keyword, so that the calculated Simhash signature value can be more accurately calculated.
Step 3, the process of the Simhash signature value retrieval stage is as follows:
the invention provides a new improved method for matching Simhash signature values, which is to judge whether the distribution of the signature values in each barrel is uniform or not and carry out the second hash operation on the hashed signature value data of non-uniform barrels, and comprises the following specific steps:
(1) inputting any document set to calculate, checking whether the distribution of the signature values among the buckets is uniform, and if the buckets are non-uniformly distributed, removing the block which is compared for the first time on the original signature value. The non-uniformity judging method comprises the following steps: first, the data amount in the bucket is judged, and if the data amount is larger than (1+ weight) times the average value of elements in the bucket, the data is considered to be uneven.
(2) And calculating the Hamming distance of other buckets without uneven buckets, and outputting data with the Hamming distance less than 3 as a similarity comparison result of the part of file data sets.
(3) For those non-uniform buckets, the first used block is removed, the remaining signature value bits are used for chunking, and then the remaining signature value blocks are hashed again into the bucket using the same hash function. And mapping the hashed signature value blocks to the signature value blocks in the same bucket for the second time, performing similarity comparison, and calculating the hamming distance.
Step 4, the improved text distance calculation method comprises the following processes:
the traditional computation of the signature value of the Simhash document is to compare the Hamming distance of the contents of the two documents, and in order to ensure that the complexity of the text deduplication time is not influenced too much and improve the precision of the Simhash algorithm, a brand-new computation method of the signature value of the document is provided according to the concept of an article abstract, and the article abstract is generated through the word frequency of keywords and the first sentence of a text segment. And then, the improved TF-IDF algorithm is used for respectively calculating the subject term weight of the document content and the article abstract. Respectively inputting improved Simhash signature value calculation methods to calculate signature values belonging to document contents and article summaries, then comparing the Hamming distance between the signature value of the document contents and the signature value of the article summaries, and finally obtaining the final Hamming distance between the two documents to obtain a repeated document set.
The calculation formula is as follows:
f(A,B)=μCHaming(A,B)+(1-μ)SHaming(A,B)
wherein f (A, B) is the signature distance of two documents, CHaming (A, B) is the document content Hamming distance of two documents, and SHAMINg (A, B) is the article abstract Hamming distance of two documents. μ is the weight of two distances.
In order to determine the value of μ, it must be obtained by experiment. Therefore, 1100 pieces of news of fox search news are used, wherein the similar news are 100 pieces, the value of mu is from 0 to 1, the value of mu with the step length of 0.01 is used as the abscissa, because in the Simhash algorithm, the article is considered to be repeated when the Hamming distance is less than 3, and the text distance is empirically selected to be 3 according to the signature value of the Simhash algorithm. The calculation results of the accuracy (Precision) and the Recall (Recall Rate) of the algorithm can be obtained as shown in fig. 2: as can be seen from the graph, when the value of μ is 0.7, that is, (1-. mu.) is 0.3, a relatively high accuracy and recall can be obtained. Substituting mu into a Hamming distance calculation formula of the two documents to obtain the Hamming distance f (A, B) of the two documents as follows:
f(A,B)=0.7CHaming(A,B)+0.3SHaming(A,B)
when f (A, B) is less than 3, the two documents are considered to be repeated.
The experimental results and analytical experimental results of step 5 are as follows:
the accuracy and recall rate of the classical evaluation data deduplication method are selected and defined as follows:
Figure BDA0003206035600000051
Figure BDA0003206035600000052
the experimental results are shown in fig. 3 and fig. 4, and the accuracy and the recall ratio obtained by observing the experiment are found to be improved by about 5% compared with the classical Simhash algorithm, because the improved Simhash algorithm not only calculates the document content, but also calculates the TF-IDF algorithm from the word segmentation tool and the characteristic weight, optimizes the signature value in multiple aspects, and finally judges whether the articles are similar or not by combining the article abstract. Therefore, the obtained accuracy and recall rate are high and stable.
The experiment finally compares the running time of the original Simhash algorithm and the improved Simhash algorithm, and as can be seen from FIG. 5, the number of documents is increased by taking seconds as a unit, the improved Simhash execution speed is higher, and because the Simhash signature value matching stage is improved, the Hash calculation is carried out for the second time compared with the condition that the bucket sorting data wasting the running time of the algorithm is not uniform, the signature value comparison times are reduced, and the improved algorithm efficiency is higher.
According to the embodiment of the invention, a big data text deduplication technology based on an improved Simhash algorithm can perform decision support for big data deduplication and data preprocessing.
The above description is for the purpose of describing in detail embodiments of the present invention with reference to the accompanying drawings, and the detailed description is for the purpose of facilitating understanding of the invention.

Claims (5)

1. A big data text duplicate removal technology based on an improved Simhash algorithm mainly comprises the following steps:
(1) performing word segmentation by adopting a word segmentation tool;
(2) giving corresponding weight to the good keywords;
(3) calculating a document content signature and an article abstract signature through the keyword weight;
(4) the calculation finds similar documents.
2. The big data text deduplication technology based on the modified Simhash algorithm as claimed in claim 1, wherein: in the step (1), an NLPIR-ICTCCLAS word segmentation System developed by the Institute of computer research of Chinese academy of sciences over a decade is selected, and on the basis of the ICTCCLAS (Institute of Computing Technology, Chinese legacy Analysis System) word segmentation System, the kernel is upgraded for more than 10 times, and the number of users exceeds 30 ten thousand. Powerful functions, mainly including Chinese and English word segmentation; part of speech tagging; named entity recognition; recognizing new words; the speed and the precision of extracting the keywords and segmenting the words are greatly improved.
3. The big data text deduplication technology based on the modified Simhash algorithm as claimed in claim 1, wherein: in the step (2), the TF-IDF algorithm is improved, the inverse file frequency is not completely used as a judgment standard of the keyword weight, and meanwhile, a judgment standard of the part of speech and the length of the keyword is introduced, so that the more important keyword after word segmentation has higher weight.
4. The big data text deduplication technology based on the modified Simhash algorithm as claimed in claim 1, wherein: in the step (3), an article abstract is generated according to the generated keyword word frequency and the first sentence of the text segment, then the signature values of the article abstract and the article content are respectively calculated, the signature value calculation adopts a signature value calculation method of a traditional Simhash algorithm, the divided keyword weight characteristic values are hashed, the hash result is weighted in dimension reduction, and finally the signature values of the article abstract and the article content are obtained.
5. The big data text deduplication technology based on the modified Simhash algorithm as claimed in claim 1, wherein: in the step (4), a brand-new document signature value calculation method is provided, the signature values of the document contents and the signature values of the article summaries are compared with each other for hamming distance, and in the process of comparing the signature values, the method is also optimized, the data distribution is more uniform by adopting secondary hash, and finally the final hamming distance of the two documents is obtained.
CN202110917142.4A 2021-08-11 2021-08-11 Big data text duplicate removal technology based on improved Simhash algorithm Pending CN113609247A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110917142.4A CN113609247A (en) 2021-08-11 2021-08-11 Big data text duplicate removal technology based on improved Simhash algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110917142.4A CN113609247A (en) 2021-08-11 2021-08-11 Big data text duplicate removal technology based on improved Simhash algorithm

Publications (1)

Publication Number Publication Date
CN113609247A true CN113609247A (en) 2021-11-05

Family

ID=78340199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110917142.4A Pending CN113609247A (en) 2021-08-11 2021-08-11 Big data text duplicate removal technology based on improved Simhash algorithm

Country Status (1)

Country Link
CN (1) CN113609247A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201959A (en) * 2021-11-16 2022-03-18 湖南长泰工业科技有限公司 Mobile emergency command method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114201959A (en) * 2021-11-16 2022-03-18 湖南长泰工业科技有限公司 Mobile emergency command method

Similar Documents

Publication Publication Date Title
US10346257B2 (en) Method and device for deduplicating web page
CN109241274B (en) Text clustering method and device
CN110321925B (en) Text multi-granularity similarity comparison method based on semantic aggregated fingerprints
JP2978044B2 (en) Document classification device
WO2020215667A1 (en) Text content quick duplicate removal method and apparatus, computer device, and storage medium
KR101508260B1 (en) Summary generation apparatus and method reflecting document feature
Broder et al. Scalable k-means by ranked retrieval
CN108763348B (en) Classification improvement method for feature vectors of extended short text words
CN108228541B (en) Method and device for generating document abstract
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
CN108363694B (en) Keyword extraction method and device
CN112000783B (en) Patent recommendation method, device and equipment based on text similarity analysis and storage medium
CN111680152A (en) Method and device for extracting abstract of target text, electronic equipment and storage medium
CN113609247A (en) Big data text duplicate removal technology based on improved Simhash algorithm
Ruambo et al. Towards enhancing information retrieval systems: A brief survey of strategies and challenges
Ramya et al. DRDLC: discovering relevant documents using latent dirichlet allocation and cosine similarity
CN110399464B (en) Similar news judgment method and system and electronic equipment
Mohammadi et al. A fast text similarity measure for large document collections using multireference cosine and genetic algorithm
Zhang et al. A hot spot clustering method based on improved kmeans algorithm
CN106951548B (en) Method and system for improving close-up word searching precision based on RM algorithm
Li et al. Complex query recognition based on dynamic learning mechanism
US9830355B2 (en) Computer-implemented method of performing a search using signatures
CN111723179B (en) Feedback model information retrieval method, system and medium based on conceptual diagram
Reed et al. A multi-agent system for distributed cluster analysis
CN114266249A (en) Mass text clustering method based on birch clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination