CN110516212A

CN110516212A - A kind of magnanimity document similarity detection method of cloud computing

Info

Publication number: CN110516212A
Application number: CN201910821968.3A
Authority: CN
Inventors: 王海涛; 常春勤; 曾艳阳; 张霄宏
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2019-09-02
Filing date: 2019-09-02
Publication date: 2019-11-29
Anticipated expiration: 2039-09-02
Also published as: CN110516212B

Abstract

The present invention discloses a kind of magnanimity document similarity detection method of cloud computing, by means of distributed file system and parallel database, build cloud computing environment, then magnanimity document sets to be detected are uploaded into parallel database, text-term relationship collection is saved in Parallel relation database using key-value pair mode in corpus；Text to be detected is after the pretreatment such as past stop words, participle, by obtaining its feature vector after feature extraction, then carries out similarity calculation with the feature vector of corpus in parallel database, generates similarity value；The present invention is suitable for the text duplicate removal of mass data collection, has the advantages that operational efficiency is high, runing time is short, solves the defect that traditional approx imately-detecting technology is not applied for mass text data set.

Description

A kind of magnanimity document similarity detection method of cloud computing

Technical field

The present invention relates to document similarities to compare field, more particularly to a kind of magnanimity document approx imately-detecting side of cloud computing Method.

Background technique

With the progress of network technology, cause major part document that can be operated by random reprinting, propagation, modification etc. on network, This subject information extraction, vectorization expression, feature weight for further increasing document unintentionally calculate and the difficulty of similarity detection. To improve the quality of data and information propagation efficiency, to reduce unnecessary resource cost, proposes that one kind is efficient, can handle sea The duplicate removal scheme for measuring document is imperative.

For the duplicate removal for solving the problems, such as magnanimity document, a kind of local sensitivity hash method is suggested, the final mesh of this method Be intended to make the feature distribution of entire document as uniform as possible by ideal hash function, make almost identical content generate it is close Similar or identical hash value, it can the similarity degree of document content is judged by the similarity degree of hash value.

Another duplicate removal detection algorithm (minhash) is also commonly used, and after which can segment document, is stored as one Then matrix carries out multiple random Harsh to the row of this matrix (or column), the Hash result minimum value of every row is taken to represent the row Feature, and so on, by a string of minimum hash instead of entire matrix, matrix dimensionality reduction, minhash are achieved the purpose that with this Using very extensively, calculating speed is also relatively high, but to usually require to generate multiple hash functions enough accurate to guarantee for this method Degree, the expense for calculating hash function are larger.

It may be incorporated into ICTCLAS segmenter and TF-IDF algorithm, and then generate the hash value of Chinese document, and pass through the Chinese The comparison of prescribed distance, so that it is determined that whether two documents are similar document.There is scholar to propose a kind of comprehensive reference The scheme of bloomfilter, trie tree and simhash algorithm, the program are completed in two stages, are passed through first Bloomfilter and trie tree carries out complete duplicate removal, then carries out similar duplicate removal by simhash algorithm, but these methods exist Main problem is that file characteristics are easily lost in mapping process.Therefore it is badly in need of a kind of magnanimity Chinese document duplicate removal scheme at present.

Summary of the invention

The object of the present invention is to provide a kind of cloud computing magnanimity document similarity detection methods, are deposited with solving the above-mentioned prior art The problem of, cost is reduced while losing file characteristics not.

To achieve the above object, the present invention provides following schemes: the present invention provides a kind of magnanimity document phase of cloud computing Include the following steps: like detection method

Step 1: cloud computing environment is built according to distributed file system and parallel database, then by text to be detected Shelves collection uploads in cloud computing environment；

Step 2: document sets to be detected are carried out with the pretreatment of stop words, participle, the text file of different-format is turned It is changed to the consistent text file of format；

Step 3: text in step 2 is transformed to a n dimension word frequency vector, i.e., word frequency vector is carried out to the text It extracts, then SimHash algorithm generates vector fingerprint, and the fingerprint length is 64 bytes, after obtaining vector fingerprint, with key-value pair Format store into sequential file, the wherein entitled key of file, 64 bit vector fingerprints are value；

Step 4: weighting all feature vectors in document to be measured, using feature weight as weighting coefficient, then ask With, then file to be detected just uses weighted sum vector to indicate, the degree at angle is presented by the vector and document sets, come judge to Survey the similarity of file.

Preferably, a multiplicity threshold value is pre-defined, when the similarity that two record is more than or equal to threshold value, it is believed that it Be duplicated records, calculating formula of similarity is as follows:

Wherein, v_iRepresent the record for being present in record A and recording same section between B, W (v_i) represent v_iQuantity, v_jGeneration Table constitutes record A and records all records merging of B, W (v_j) represent v_jQuantity.

The invention discloses following technical effects: the application passes through the pre- place that deactivates, segment to document sets to be detected Reason, the term vector for being changed into n dimension is suitable for the text duplicate removal of mass data collection, by the feature vector and document sets that obtain text The angle of presentation, to judge that the similarity of file to be measured, the advantage that the method operational efficiency is high, runing time is short solve biography System approx imately-detecting technology is not applied for the defect of mass text data set.

Detailed description of the invention

It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also be according to these attached drawings Obtain other attached drawings.

Fig. 1 is mass text approx imately-detecting flow diagram of the invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.

The present invention provides a kind of magnanimity document similarity detection method of cloud computing, includes the following steps:

Step 3: text in step 2 is transformed to a n dimension word frequency vector, i.e., word frequency vector is carried out to the text It extracts, then SimHash algorithm generates vector fingerprint, and fingerprint generation is the important prerequisite of similarity calculation detection, and fingerprint generates It is the important prerequisite of similarity calculation detection, by taking the generation of fingerprint sequence as an example, if the number that some binary system is 0101 indicates one A four Hash characteristic signatures, then four dimensional vectors caused by this feature are (- 1,1, -1,1)^T, that is, hash signature some When being 0 on position, mapping vector on corresponding position is -1；If be 1 on some position of hash signature, map on corresponding position Vector is 1；Then in a document, all feature vectors are weighted and is added, wherein feature weight is used to indicate in operation Weighting coefficient.

It is assumed that five feature d of whole documents₁,…,d₅It indicates, this corresponding 3 dimensional vector of five features is respectively as follows: v (d₁)=(1, -1,1)^T, v (d₂)=(- 1,1,1)^T, v (d₃)=(1, -1, -1)^T, v (d₄)=(- 1, -1,1)^T, v (d₅)=(1, 1,-1)^T, now to obtain the one 3 dimension signature of any document.According to Simhash working principle it is found that some document D=(d₁= 1,d₂=2, d₃=0, d₄=3, d₅=0) T, if wanting to obtain its hash signature, according to above-mentioned principle, calculation formula are as follows:

d₁*v(d₁)+d₂*v(d₂)+d₃*v(d₃)+d₄*v(d₄)+d₅*v(d₅)=(- 4, -2,6)^T；According to SimHash principle, If some element value of vector less than 0, corresponding position upper value of signing be 0, on the contrary it is then be 1.Therefore, its signature value m is finally obtained =001.The fingerprint length is 64 bytes, after obtaining vector fingerprint, is stored with the format of key-value pair into sequential file, wherein The entitled key of file, 64 bit vector fingerprints are value,

Step 4: weighting all feature vectors in document to be measured, using feature weight as weighting coefficient, then ask With, then file to be detected just uses weighted sum vector to indicate, the degree at angle is presented by the vector and document sets, come judge to Survey the similarity of file.A multiplicity threshold value is pre-defined, when the similarity that two record is more than or equal to threshold value, it is believed that it Be duplicated records, calculating formula of similarity is as follows:

In view of environmental restrictions, the present invention is by taking Simhash signature generates as an example:

Assuming that document A content: the U.S. " 51st area " employee claims inside to have 9 frame flying saucers, once sees grey outman.Implementation method It is as follows:

Step 1: participle: document A being carried out text and segments to form feature word, eventually forms the word for removing noise word Sequence simultaneously adds weight for each word, it will be assumed that weight is divided into 5 ranks.It is " U.S. (4), 51st area (5), employee after participle (3), claim (1), internal (2) have (1) 9 frame (3), flying saucer (5) once (1) saw (3), grey (4), outman (5) ", in bracket It is to represent word significance level in entire sentence, the bigger number the more important.

Step 2: hash is handled: being that a n ties up word frequency vector by text transform, i.e., carry out word frequency vector to the text Extraction, due to fingerprint generate be similarity calculation detection important prerequisite, using hash algorithm generate vector fingerprint, i.e., Each word is become hash value by hash algorithm, for example " U.S. " is calculated as 100101 by hash algorithm, " 51st area " passes through Hash algorithm is calculated as 101011.Character string is reformed into a string number, realizes the digitized process of text.

Step 3: weighting is handled: in document to be measured, being generated by hash of upper stage as a result, by all feature vectors It weights and is added, need to form weighted number word string according to the weight of word, wherein feature weight is used to indicate the weighting in operation Coefficient, then file to be detected just uses weighted sum vector to indicate, the degree at angle is presented, by the vector and document sets to judge The similarity of file to be measured.For example the hash value in " U.S. " is " 100101 ", is " 4-4-44-44 " by weighted calculation；"51 The hash value in area " is " 101011 ", is " 5-55-555 " by weighted calculation.

Step 4: union operation: the sequential value that each word calculates above being added up, only one sequence string is become. For example " 4-4-44-44 " in " U.S. ", " 5-55-555 " in " 51st area ", each corresponding position add up, and are converted to " 9- 91-119".It only lets it pass as example two words, true calculate needs the sequence string of all words to add up.

Step 5: dimensionality reduction: " 9-91-119 " that step 4 step calculates being become 01 string, forms final simhash label Name.If each, which is greater than 0, is denoted as 1,0 is denoted as less than 0.Finally calculate result are as follows: " 101011 ".

It can when measuring the similitude between two texts Step 6: calculating the simhash signature value generated by the upper stage To be carried out by comparing 0 and 1 different quantity between two simhash.

The application is changed into the term vector that n is tieed up and is suitable for by the pretreatment for deactivating, segmenting to document sets to be detected The text duplicate removal of mass data collection, the angle presented by the feature vector and document sets that obtain text, to judge file to be measured Similarity, the advantage that the method operational efficiency is high, runing time is short solves traditional approx imately-detecting technology and is not applied for sea Measure the defect of text data set.

In the description of the present invention, it is to be understood that, term " longitudinal direction ", " transverse direction ", "upper", "lower", "front", "rear", The orientation or positional relationship of the instructions such as "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outside" is based on attached drawing institute The orientation or positional relationship shown is merely for convenience of the description present invention, rather than the device or element of indication or suggestion meaning must There must be specific orientation, be constructed and operated in a specific orientation, therefore be not considered as limiting the invention.

Embodiment described above is only that preferred embodiment of the invention is described, and is not carried out to the scope of the present invention It limits, without departing from the spirit of the design of the present invention, those of ordinary skill in the art make technical solution of the present invention Various changes and improvements, should all fall into claims of the present invention determine protection scope in.

Claims

1. a kind of magnanimity document similarity detection method of cloud computing, which comprises the steps of:

Step 1: cloud computing environment is built according to distributed file system and parallel database, then by document sets to be detected It uploads in cloud computing environment；

Step 2: document sets to be detected are carried out with the pretreatment of stop words, participle, the text file of different-format is converted to The consistent text file of format；

Step 3: text in step 2 is transformed to a n dimension word frequency vector, i.e., word frequency vector is carried out to the text and mentioned It takes, then SimHash algorithm generates vector fingerprint, and the fingerprint length is 64 bytes, after obtaining vector fingerprint, with key-value pair Format is stored into sequential file, wherein the entitled key of file, and 64 bit vector fingerprints are value；

Step 4: weighting all feature vectors in document to be measured, using feature weight as weighting coefficient, then sum, then File to be detected just uses weighted sum vector to indicate, the degree at angle is presented, by the vector and document sets to judge text to be measured The similarity of part.

2. the magnanimity document similarity detection method of cloud computing according to claim 1, it is characterised in that: one pre-defined Multiplicity threshold value, when the similarity that two record is more than or equal to threshold value, it is believed that they are duplicated records, similarity calculation Formula is as follows:

Wherein, v_iRepresent the record for being present in record A and recording same section between B, W (v_i) represent v_iQuantity, v_jRepresent structure Merge at record A and all records for recording B, W (v_j) represent v_jQuantity.