CN110516212A - A kind of magnanimity document similarity detection method of cloud computing - Google Patents

A kind of magnanimity document similarity detection method of cloud computing Download PDF

Info

Publication number
CN110516212A
CN110516212A CN201910821968.3A CN201910821968A CN110516212A CN 110516212 A CN110516212 A CN 110516212A CN 201910821968 A CN201910821968 A CN 201910821968A CN 110516212 A CN110516212 A CN 110516212A
Authority
CN
China
Prior art keywords
text
vector
cloud computing
document
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910821968.3A
Other languages
Chinese (zh)
Other versions
CN110516212B (en
Inventor
王海涛
常春勤
曾艳阳
张霄宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University of Technology
Original Assignee
Henan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University of Technology filed Critical Henan University of Technology
Priority to CN201910821968.3A priority Critical patent/CN110516212B/en
Publication of CN110516212A publication Critical patent/CN110516212A/en
Application granted granted Critical
Publication of CN110516212B publication Critical patent/CN110516212B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/116Details of conversion of file system types or formats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of magnanimity document similarity detection method of cloud computing, by means of distributed file system and parallel database, build cloud computing environment, then magnanimity document sets to be detected are uploaded into parallel database, text-term relationship collection is saved in Parallel relation database using key-value pair mode in corpus;Text to be detected is after the pretreatment such as past stop words, participle, by obtaining its feature vector after feature extraction, then carries out similarity calculation with the feature vector of corpus in parallel database, generates similarity value;The present invention is suitable for the text duplicate removal of mass data collection, has the advantages that operational efficiency is high, runing time is short, solves the defect that traditional approx imately-detecting technology is not applied for mass text data set.

Description

A kind of magnanimity document similarity detection method of cloud computing
Technical field
The present invention relates to document similarities to compare field, more particularly to a kind of magnanimity document approx imately-detecting side of cloud computing Method.
Background technique
With the progress of network technology, cause major part document that can be operated by random reprinting, propagation, modification etc. on network, This subject information extraction, vectorization expression, feature weight for further increasing document unintentionally calculate and the difficulty of similarity detection. To improve the quality of data and information propagation efficiency, to reduce unnecessary resource cost, proposes that one kind is efficient, can handle sea The duplicate removal scheme for measuring document is imperative.
For the duplicate removal for solving the problems, such as magnanimity document, a kind of local sensitivity hash method is suggested, the final mesh of this method Be intended to make the feature distribution of entire document as uniform as possible by ideal hash function, make almost identical content generate it is close Similar or identical hash value, it can the similarity degree of document content is judged by the similarity degree of hash value.
Another duplicate removal detection algorithm (minhash) is also commonly used, and after which can segment document, is stored as one Then matrix carries out multiple random Harsh to the row of this matrix (or column), the Hash result minimum value of every row is taken to represent the row Feature, and so on, by a string of minimum hash instead of entire matrix, matrix dimensionality reduction, minhash are achieved the purpose that with this Using very extensively, calculating speed is also relatively high, but to usually require to generate multiple hash functions enough accurate to guarantee for this method Degree, the expense for calculating hash function are larger.
It may be incorporated into ICTCLAS segmenter and TF-IDF algorithm, and then generate the hash value of Chinese document, and pass through the Chinese The comparison of prescribed distance, so that it is determined that whether two documents are similar document.There is scholar to propose a kind of comprehensive reference The scheme of bloomfilter, trie tree and simhash algorithm, the program are completed in two stages, are passed through first Bloomfilter and trie tree carries out complete duplicate removal, then carries out similar duplicate removal by simhash algorithm, but these methods exist Main problem is that file characteristics are easily lost in mapping process.Therefore it is badly in need of a kind of magnanimity Chinese document duplicate removal scheme at present.
Summary of the invention
The object of the present invention is to provide a kind of cloud computing magnanimity document similarity detection methods, are deposited with solving the above-mentioned prior art The problem of, cost is reduced while losing file characteristics not.
To achieve the above object, the present invention provides following schemes: the present invention provides a kind of magnanimity document phase of cloud computing Include the following steps: like detection method
Step 1: cloud computing environment is built according to distributed file system and parallel database, then by text to be detected Shelves collection uploads in cloud computing environment;
Step 2: document sets to be detected are carried out with the pretreatment of stop words, participle, the text file of different-format is turned It is changed to the consistent text file of format;
Step 3: text in step 2 is transformed to a n dimension word frequency vector, i.e., word frequency vector is carried out to the text It extracts, then SimHash algorithm generates vector fingerprint, and the fingerprint length is 64 bytes, after obtaining vector fingerprint, with key-value pair Format store into sequential file, the wherein entitled key of file, 64 bit vector fingerprints are value;
Step 4: weighting all feature vectors in document to be measured, using feature weight as weighting coefficient, then ask With, then file to be detected just uses weighted sum vector to indicate, the degree at angle is presented by the vector and document sets, come judge to Survey the similarity of file.
Preferably, a multiplicity threshold value is pre-defined, when the similarity that two record is more than or equal to threshold value, it is believed that it Be duplicated records, calculating formula of similarity is as follows:
Wherein, viRepresent the record for being present in record A and recording same section between B, W (vi) represent viQuantity, vjGeneration Table constitutes record A and records all records merging of B, W (vj) represent vjQuantity.
The invention discloses following technical effects: the application passes through the pre- place that deactivates, segment to document sets to be detected Reason, the term vector for being changed into n dimension is suitable for the text duplicate removal of mass data collection, by the feature vector and document sets that obtain text The angle of presentation, to judge that the similarity of file to be measured, the advantage that the method operational efficiency is high, runing time is short solve biography System approx imately-detecting technology is not applied for the defect of mass text data set.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.
Fig. 1 is mass text approx imately-detecting flow diagram of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.
The present invention provides a kind of magnanimity document similarity detection method of cloud computing, includes the following steps:
Step 1: cloud computing environment is built according to distributed file system and parallel database, then by text to be detected Shelves collection uploads in cloud computing environment;
Step 2: document sets to be detected are carried out with the pretreatment of stop words, participle, the text file of different-format is turned It is changed to the consistent text file of format;
Step 3: text in step 2 is transformed to a n dimension word frequency vector, i.e., word frequency vector is carried out to the text It extracts, then SimHash algorithm generates vector fingerprint, and fingerprint generation is the important prerequisite of similarity calculation detection, and fingerprint generates It is the important prerequisite of similarity calculation detection, by taking the generation of fingerprint sequence as an example, if the number that some binary system is 0101 indicates one A four Hash characteristic signatures, then four dimensional vectors caused by this feature are (- 1,1, -1,1)T, that is, hash signature some When being 0 on position, mapping vector on corresponding position is -1;If be 1 on some position of hash signature, map on corresponding position Vector is 1;Then in a document, all feature vectors are weighted and is added, wherein feature weight is used to indicate in operation Weighting coefficient.
It is assumed that five feature d of whole documents1,…,d5It indicates, this corresponding 3 dimensional vector of five features is respectively as follows: v (d1)=(1, -1,1)T, v (d2)=(- 1,1,1)T, v (d3)=(1, -1, -1)T, v (d4)=(- 1, -1,1)T, v (d5)=(1, 1,-1)T, now to obtain the one 3 dimension signature of any document.According to Simhash working principle it is found that some document D=(d1= 1,d2=2, d3=0, d4=3, d5=0) T, if wanting to obtain its hash signature, according to above-mentioned principle, calculation formula are as follows:
d1*v(d1)+d2*v(d2)+d3*v(d3)+d4*v(d4)+d5*v(d5)=(- 4, -2,6)T;According to SimHash principle, If some element value of vector less than 0, corresponding position upper value of signing be 0, on the contrary it is then be 1.Therefore, its signature value m is finally obtained =001.The fingerprint length is 64 bytes, after obtaining vector fingerprint, is stored with the format of key-value pair into sequential file, wherein The entitled key of file, 64 bit vector fingerprints are value,
Step 4: weighting all feature vectors in document to be measured, using feature weight as weighting coefficient, then ask With, then file to be detected just uses weighted sum vector to indicate, the degree at angle is presented by the vector and document sets, come judge to Survey the similarity of file.A multiplicity threshold value is pre-defined, when the similarity that two record is more than or equal to threshold value, it is believed that it Be duplicated records, calculating formula of similarity is as follows:
Wherein, viRepresent the record for being present in record A and recording same section between B, W (vi) represent viQuantity, vjGeneration Table constitutes record A and records all records merging of B, W (vj) represent vjQuantity.
In view of environmental restrictions, the present invention is by taking Simhash signature generates as an example:
Assuming that document A content: the U.S. " 51st area " employee claims inside to have 9 frame flying saucers, once sees grey outman.Implementation method It is as follows:
Step 1: participle: document A being carried out text and segments to form feature word, eventually forms the word for removing noise word Sequence simultaneously adds weight for each word, it will be assumed that weight is divided into 5 ranks.It is " U.S. (4), 51st area (5), employee after participle (3), claim (1), internal (2) have (1) 9 frame (3), flying saucer (5) once (1) saw (3), grey (4), outman (5) ", in bracket It is to represent word significance level in entire sentence, the bigger number the more important.
Step 2: hash is handled: being that a n ties up word frequency vector by text transform, i.e., carry out word frequency vector to the text Extraction, due to fingerprint generate be similarity calculation detection important prerequisite, using hash algorithm generate vector fingerprint, i.e., Each word is become hash value by hash algorithm, for example " U.S. " is calculated as 100101 by hash algorithm, " 51st area " passes through Hash algorithm is calculated as 101011.Character string is reformed into a string number, realizes the digitized process of text.
Step 3: weighting is handled: in document to be measured, being generated by hash of upper stage as a result, by all feature vectors It weights and is added, need to form weighted number word string according to the weight of word, wherein feature weight is used to indicate the weighting in operation Coefficient, then file to be detected just uses weighted sum vector to indicate, the degree at angle is presented, by the vector and document sets to judge The similarity of file to be measured.For example the hash value in " U.S. " is " 100101 ", is " 4-4-44-44 " by weighted calculation;"51 The hash value in area " is " 101011 ", is " 5-55-555 " by weighted calculation.
Step 4: union operation: the sequential value that each word calculates above being added up, only one sequence string is become. For example " 4-4-44-44 " in " U.S. ", " 5-55-555 " in " 51st area ", each corresponding position add up, and are converted to " 9- 91-119".It only lets it pass as example two words, true calculate needs the sequence string of all words to add up.
Step 5: dimensionality reduction: " 9-91-119 " that step 4 step calculates being become 01 string, forms final simhash label Name.If each, which is greater than 0, is denoted as 1,0 is denoted as less than 0.Finally calculate result are as follows: " 101011 ".
It can when measuring the similitude between two texts Step 6: calculating the simhash signature value generated by the upper stage To be carried out by comparing 0 and 1 different quantity between two simhash.
The application is changed into the term vector that n is tieed up and is suitable for by the pretreatment for deactivating, segmenting to document sets to be detected The text duplicate removal of mass data collection, the angle presented by the feature vector and document sets that obtain text, to judge file to be measured Similarity, the advantage that the method operational efficiency is high, runing time is short solves traditional approx imately-detecting technology and is not applied for sea Measure the defect of text data set.
In the description of the present invention, it is to be understood that, term " longitudinal direction ", " transverse direction ", "upper", "lower", "front", "rear", The orientation or positional relationship of the instructions such as "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outside" is based on attached drawing institute The orientation or positional relationship shown is merely for convenience of the description present invention, rather than the device or element of indication or suggestion meaning must There must be specific orientation, be constructed and operated in a specific orientation, therefore be not considered as limiting the invention.
Embodiment described above is only that preferred embodiment of the invention is described, and is not carried out to the scope of the present invention It limits, without departing from the spirit of the design of the present invention, those of ordinary skill in the art make technical solution of the present invention Various changes and improvements, should all fall into claims of the present invention determine protection scope in.

Claims (2)

1. a kind of magnanimity document similarity detection method of cloud computing, which comprises the steps of:
Step 1: cloud computing environment is built according to distributed file system and parallel database, then by document sets to be detected It uploads in cloud computing environment;
Step 2: document sets to be detected are carried out with the pretreatment of stop words, participle, the text file of different-format is converted to The consistent text file of format;
Step 3: text in step 2 is transformed to a n dimension word frequency vector, i.e., word frequency vector is carried out to the text and mentioned It takes, then SimHash algorithm generates vector fingerprint, and the fingerprint length is 64 bytes, after obtaining vector fingerprint, with key-value pair Format is stored into sequential file, wherein the entitled key of file, and 64 bit vector fingerprints are value;
Step 4: weighting all feature vectors in document to be measured, using feature weight as weighting coefficient, then sum, then File to be detected just uses weighted sum vector to indicate, the degree at angle is presented, by the vector and document sets to judge text to be measured The similarity of part.
2. the magnanimity document similarity detection method of cloud computing according to claim 1, it is characterised in that: one pre-defined Multiplicity threshold value, when the similarity that two record is more than or equal to threshold value, it is believed that they are duplicated records, similarity calculation Formula is as follows:
Wherein, viRepresent the record for being present in record A and recording same section between B, W (vi) represent viQuantity, vjRepresent structure Merge at record A and all records for recording B, W (vj) represent vjQuantity.
CN201910821968.3A 2019-09-02 2019-09-02 Cloud computing mass document similarity detection method Active CN110516212B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910821968.3A CN110516212B (en) 2019-09-02 2019-09-02 Cloud computing mass document similarity detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910821968.3A CN110516212B (en) 2019-09-02 2019-09-02 Cloud computing mass document similarity detection method

Publications (2)

Publication Number Publication Date
CN110516212A true CN110516212A (en) 2019-11-29
CN110516212B CN110516212B (en) 2022-10-28

Family

ID=68629137

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910821968.3A Active CN110516212B (en) 2019-09-02 2019-09-02 Cloud computing mass document similarity detection method

Country Status (1)

Country Link
CN (1) CN110516212B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324750A (en) * 2020-02-29 2020-06-23 上海爱数信息技术股份有限公司 Large-scale text similarity calculation and text duplicate checking method
CN111428180A (en) * 2020-03-20 2020-07-17 名创优品(横琴)企业管理有限公司 Webpage duplicate removal method, device and equipment
CN112529111A (en) * 2020-12-28 2021-03-19 广东国粒教育技术有限公司 Method for calculating class preparation innovation degree of teacher based on ppt document comparison technology
CN112749131A (en) * 2020-06-11 2021-05-04 腾讯科技(上海)有限公司 Information duplicate elimination processing method and device and computer readable storage medium
CN113129056A (en) * 2021-04-15 2021-07-16 微梦创科网络科技(中国)有限公司 Method and system for controlling advertisement putting frequency
CN114386384A (en) * 2021-12-06 2022-04-22 鹏城实验室 Approximate repetition detection method, system and terminal for large-scale long text data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006201926A (en) * 2005-01-19 2006-08-03 Konica Minolta Holdings Inc Similar document retrieval system, similar document retrieval method and program
CN103577418A (en) * 2012-07-24 2014-02-12 北京拓尔思信息技术股份有限公司 Massive document distribution searching duplication removing system and method
CN108132929A (en) * 2017-12-25 2018-06-08 上海大学 A kind of similarity calculation method of magnanimity non-structured text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006201926A (en) * 2005-01-19 2006-08-03 Konica Minolta Holdings Inc Similar document retrieval system, similar document retrieval method and program
CN103577418A (en) * 2012-07-24 2014-02-12 北京拓尔思信息技术股份有限公司 Massive document distribution searching duplication removing system and method
CN108132929A (en) * 2017-12-25 2018-06-08 上海大学 A kind of similarity calculation method of magnanimity non-structured text

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
任民山等: "基于Simhash算法的海量文本相似性检测方法研究", 《计量与测试技术》 *
姜雪等: "基于语义指纹的海量文本快速相似检测算法研究", 《电脑知识与技术》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111324750A (en) * 2020-02-29 2020-06-23 上海爱数信息技术股份有限公司 Large-scale text similarity calculation and text duplicate checking method
CN111324750B (en) * 2020-02-29 2021-07-13 上海爱数信息技术股份有限公司 Large-scale text similarity calculation and text duplicate checking method
CN111428180A (en) * 2020-03-20 2020-07-17 名创优品(横琴)企业管理有限公司 Webpage duplicate removal method, device and equipment
CN112749131A (en) * 2020-06-11 2021-05-04 腾讯科技(上海)有限公司 Information duplicate elimination processing method and device and computer readable storage medium
CN112529111A (en) * 2020-12-28 2021-03-19 广东国粒教育技术有限公司 Method for calculating class preparation innovation degree of teacher based on ppt document comparison technology
CN113129056A (en) * 2021-04-15 2021-07-16 微梦创科网络科技(中国)有限公司 Method and system for controlling advertisement putting frequency
CN114386384A (en) * 2021-12-06 2022-04-22 鹏城实验室 Approximate repetition detection method, system and terminal for large-scale long text data
CN114386384B (en) * 2021-12-06 2024-03-19 鹏城实验室 Approximate repetition detection method, system and terminal for large-scale long text data

Also Published As

Publication number Publication date
CN110516212B (en) 2022-10-28

Similar Documents

Publication Publication Date Title
CN110516212A (en) A kind of magnanimity document similarity detection method of cloud computing
CN103678702B (en) Video duplicate removal method and device
CN105718506B (en) A kind of method of science and technology item duplicate checking comparison
US11507601B2 (en) Matching a first collection of strings with a second collection of strings
CN104376003B (en) A kind of video retrieval method and device
CN103970722B (en) A kind of method of content of text duplicate removal
CN102693311B (en) Target retrieval method based on group of randomized visual vocabularies and context semantic information
Peuhkurinen et al. Comparing individual tree detection and the area-based statistical approach for the retrieval of forest stand characteristics using airborne laser scanning in Scots pine stands
WO2019144066A1 (en) Systems and methods for preparing data for use by machine learning algorithms
WO2016155386A1 (en) Method and device for determining whether webpage comprises point of interest (poi) data
US9535954B2 (en) Join processing device, data management device, and string similarity join system
CN106873964A (en) A kind of improved SimHash detection method of code similarities
CN103049496B (en) A kind of multiple users are carried out the method for customer group division, device and equipment
CN104636325B (en) A kind of method based on Maximum-likelihood estimation determination Documents Similarity
CN105718590A (en) Multi-tenant oriented SaaS public opinion monitoring system and method
Zhu et al. A multiscale object detection approach for remote sensing images based on MSE-DenseNet and the dynamic anchor assignment
CN104050299A (en) Method for paper duplicate checking
CN110020026A (en) The duplicate checking system and method for project application data
CN109271546A (en) The foundation of image retrieval Feature Selection Model, Database and search method
CN115392237A (en) Emotion analysis model training method, device, equipment and storage medium
REN et al. Seismic event classification based on bagging ensemble learning algorithm
CN106203165A (en) The big data analysis method for supporting of information based on credible cloud computing
CN113780346B (en) Priori constraint classifier adjustment method, system and readable storage medium
CN109977131A (en) A kind of house type matching system
CN109583371A (en) Landmark information based on deep learning extracts and matching process

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant