CN102682104A - Method for searching similar texts and link bit similarity measuring algorithm - Google Patents

Method for searching similar texts and link bit similarity measuring algorithm Download PDF

Info

Publication number
CN102682104A
CN102682104A CN2012101353393A CN201210135339A CN102682104A CN 102682104 A CN102682104 A CN 102682104A CN 2012101353393 A CN2012101353393 A CN 2012101353393A CN 201210135339 A CN201210135339 A CN 201210135339A CN 102682104 A CN102682104 A CN 102682104A
Authority
CN
China
Prior art keywords
fingerprint
similarity
text
document
connect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012101353393A
Other languages
Chinese (zh)
Inventor
龙军
袁鑫攀
罗跃逸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN2012101353393A priority Critical patent/CN102682104A/en
Publication of CN102682104A publication Critical patent/CN102682104A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Collating Specific Patterns (AREA)

Abstract

The invention discloses a method for searching similar texts, which comprises the following steps: step 1, text feature extraction: the step is used for extracting a text characteristic set Sshgs; step 2, link bit fingerprint generation: the step is used for generating a link bit fingerprint from the Sshgs, and the link bit fingerprint is recorded as Sdn; step 3, link bit similarity measurement: the step is used for comparing the similarity of the link bit fingerprints of two texts; and step 4, according to the result of the link bit similarity, a needed text is obtained. The invention also correspondingly discloses a link bit similarity measuring algorithm, and according to the experimental data, the facts that under the condition of slightly sacrificing precision, the comparison times are exponentially reduced in the algorithm and the performance of the algorithm is improved are proved.

Description

An a kind of method and a connection position similarity measurement algorithm of searching similar text
Technical field
The present invention relates to information retrieval field, relate in particular to a kind of method of estimating similarity, the measuring similarity that can be applicable between the magnanimity document is estimated, is specially adapted to search fast in the magnanimity information similar text message.
Technical background
The fast development of Internet technology makes that the data message on the network presents exponential growth, how in the information of magnanimity, to search effective information fast, and it is more and more important to become.Text similarity this notion of tolerance and correlation technique are also arisen at the historic moment.A good text similarity measure has great importance in research fields such as automatically request-answering system, intelligent retrieval, removing duplicate webpages, natural language processings.
Text similarity is meant a metric parameter of the matching degree between two or more texts, and similarity is high more, representes that the similarity degree between two texts is big more, otherwise low more.To be vector space model (VSM) wait to look into the frequency vector inner product that document and a certain piece of writing of data centralization document have weight through calculating to the traditional text method for measuring similarity, obtains the similarity of two pieces of documents.Algorithm need be stored shortcomings such as number of characteristics vocabulary, comparison speed is slow, accuracy rate is low, can't be applied to measuring similarity in the mass data.Based on minwise similarity measurement algorithm through the similarity problem being converted into the probability of happening problem of an incident; This method is mapped to the text feature lexical set in the hash value set; Character string comparison problem is converted into numeric ratio to problem; Be applicable to the mass data measuring similarity, but algorithm need be compared a large amount of fingerprints, take a large amount of storage spaces.2010; People such as Ping Li improve on the basis of minwise similarity measurement algorithm; Proposed b position minwise similarity measurement algorithm, this algorithm is estimated the similarity of two documents through using still less b position, but algorithm still need be compared a large amount of fingerprints.
Summary of the invention
The present invention proposes a kind of new method of searching similar text, to overcome aforementioned all deficiencies of the prior art.
According to the method for the invention, may further comprise the steps:
Step 1, the text feature extraction step: this step is used to extract the text feature S set Shgs
Step 2, connect the position fingerprint and generate step: this step is used for S ShgsGenerate a connection position fingerprint, be designated as S Dn
Step 3, connect position similarity measurement step: this step is used for the connection position fingerprint similarity of two documents of comparison;
Step 4 is utilized the connection position fingerprint similarity result that obtains, and obtains the text that needs.
The present invention also provides a kind of connection position similarity
Figure BSA00000711392000011
algorithm; It is characterized in that comprising aforesaid step 1; Step 2, step 3.
Description of drawings
Fig. 1 is a main schematic flow sheet according to the method for the invention
Fig. 2 is the graph of a relation that connects a similarity and variance according to an embodiment of the invention
Fig. 3 connects the accuracy rate of position and the experimental result of calling rate according to embodiments of the invention in the XX data centralization
Fig. 4 be according to embodiments of the invention the XX data set actual efficiency comparison figure
Embodiment
Below with reference to accompanying drawing method provided by the invention is described in detail, and will will carry out bright specifically to the advantage of the method according to this invention in conjunction with instance and experimental data.Experiment shows that method of the present invention is being sacrificed under the situation of very little precision slightly, but can reduce the number of times of comparison exponentially, promotes and searches performance.
The method of searching similar text that the present invention proposes specifically comprises the steps:
Step 1, the text feature extraction step: this step is used to extract the text feature S set Shgs
Step 2, connect the position fingerprint and generate step: this step is used for S ShgsGenerate a connection position fingerprint, be designated as S Dn
Step 3, connect position similarity measurement step: this step is used for the connection position fingerprint similarity of two documents of comparison;
Step 4 is utilized the connection position fingerprint similarity result that obtains, and obtains the text that needs.
Preferably, in step 1, specifically comprise:
At first, text message is carried out scanning analysis, utilize the Chinese word segmentation algorithm that document is carried out participle, generate set of words; Then, the vocabulary of make up stopping using, and utilize the vocabulary of stopping using to filter out the characteristic set Sshgs that branch set of words behind the text noise data is document.Noise is insignificant word in the text, generally is the low adopted auxiliary word of high frequency, function word etc.;
Preferably, in step 2, specifically comprise:
1) forms the minwise fingerprint
File characteristics S set to the step 1 generation ShgsAdopt the Rabin function, shine upon 32 integer, mapping back set called after S dSuppose complete or collected works Ω=0,1 ..., D-1}, a 0a 1... a D-1An arrangement on the Hang Seng Index Ω, vector (a 0, a 1..., a D-1) represent the displacement of Ω:
π = 0 1 . . . D - 1 a 0 a 1 . . . a D - 1
If for data set X ∈ Ω and x ∈ X, exist one to arrange π, make
Pr ( min { π ( X ) } = π ( x ) ) = 1 | X |
Then π is a minwise arrangement at random.In other words, any element x among the data set X is in the minimum value that all has under the displacement π after identical probability is this displacement.Like this, the permutation group π through k independent random 1, π 2..., π k, just S set dConvert the minwise characteristic fingerprint into: S ‾ d = ( Min { π 1 ( S d ) } , Min { π 2 ( S d ) } , . . . , Min { π k ( S d ) } ) .
2) form b position minwise fingerprint
Defined function: B (x, b)=x&&2 B-1, B (x, b) for getting the b bit function, the b in the function for the figure place that will get.
Figure BSA00000711392000025
gets the b position for each element in
Figure BSA00000711392000024
, forms b position minwise characteristic fingerprint:
B ( S d ‾ , b ) = ( B ( min { π 1 ( S d ) } , b ) , B ( min { π 2 ( S d ) } , b ) , . . . , B ( min { π k ( S d ) } , b ) ) .
3) form a connection position fingerprint
Right
Figure BSA00000711392000027
Connect n b position fingerprint, obtain connecting position characteristic fingerprint S Dn
Below specify the process of step 2 with instance 1, specialize be instance among the application only as the effect of example description, do not constitute qualification of the present invention.
Instance 1 fingerprint forms: suppose complete or collected works Ω=0,1,2,3,4,5,6,7}, S 1=1,2,4}, S 2={ 1,4,3,6} gets k=6, random alignment π 1, π 2, π 3, π 4, π 5, π 6For:
π 1 = 0 1 2 3 4 5 6 7 2 3 0 4 6 7 1 5 π 2 = 0 1 2 3 4 5 6 7 1 6 5 7 2 0 4 3
π 3 = 0 1 2 3 4 5 6 7 5 1 7 2 6 3 4 0 π 4 = 0 1 2 3 4 5 6 7 7 1 5 4 3 2 6 0
π 5 = 0 1 2 3 4 5 6 7 3 7 6 0 4 5 1 2 π 6 = 0 1 2 3 4 5 6 7 4 1 5 0 3 6 7 2
1) forms the minwise fingerprint
Pass through π 1, π 2, π 3, π 4, π 5, π 6To S 1After the mapping be:
π 1(S 1)={3,0,6},π 2(S 1)={6,5,2},π 3(S 1)={1,7,6},π 4(S 1)={1,5,3},π 5(S 1)={7,6,4},π 6(S 1)={1,5,3};
The minwise fingerprint of document 1 is: S ‾ 1 = ( Min { π 1 ( S 1 ) } , Min { π 2 ( S 1 ) } , . . . , Min { π 6 ( S 1 ) } ) = ( 0,2,1,1,4,1 )
Pass through π 1, π 2, π 3, π 4, π 5, π 6To S 2After the mapping be:
π 1(S 2)={3,6,4,1},π 2(S 2)={6,2,7,4},π 3(S 2)={1,6,2,4},π 4(S 2)={1,3,4,6},π 5(S 2)={7,4,0,1},π 6(S 2)={1,3,0,7};
The minwise fingerprint of document 2 is: S ‾ 2 = ( Min { π 1 ( S 2 ) } , Min { π 2 ( S 2 ) } , . . . , Min { π 6 ( S 2 ) } ) = ( 1 , 2,1,1 , 0 , 0 ) Therefore, S 1And S 2At π 1, π 2, π 3, π 4, π 5, π 6The minwise set that generates behind the random permutation is respectively With
Figure BSA00000711392000036
2) form b position minwise fingerprint
After got the b=1 position, the b position minwise fingerprint of trying to achieve:
B ( S 1 ‾ , b ) = ( B ( min { π 1 ( S 1 ) } , b ) , B ( min { π 2 ( S 1 ) } , b ) , . . . , B ( min { π 6 ( S 1 ) } , b ) ) = ( 0,0,1,1,0,1 )
After
Figure BSA00000711392000039
got the b=1 position, the b position minwise fingerprint of trying to achieve:
B ( S 2 ‾ , b ) = ( B ( min { π 1 ( S 2 ) } , b ) , B ( min { π 2 ( S 2 ) } , b ) , . . . , B ( min { π 6 ( S 2 ) } , b ) ) = ( 1,0,1,1,0,0 )
3) form a connection position fingerprint
Right Connect n=2 b position fingerprint: S 1n={ 0-0,1-1,0-1}={00,11,01}
Right Connect n=2 b position fingerprint: S 2n={ 1-0,1-1,0-0}={10,11,00}
Preferably, step 3 specifically comprises:
1) the minwise similarity is estimated
In minwise similarity measurement algorithm, the nothing of the likelihood R of two documents estimates that partially is:
R ^ M = 1 k Σ j = 1 k 1 { min ( π j ( S 1 ) ) = min ( π j ( S 2 ) ) } .
2) minwise similarity in b position is estimated
Definition z 1, z 2Be that a random permutation crowd π acts on S set 1And S 2On minimum value:
z 1=min{π(S 1)},z 2=min{π(S 2)}
e 1, iBe z 1Minimum i position, e 2, iBe z 2Minimum i position.In b position minwise similarity was estimated, the nothing of the similarity of two documents was estimated partially:
R ^ b = E ^ b - C 1 , b 1 - C 2 , b
Wherein
E ^ b = 1 k Σ j = 1 k ( Π i = 1 b 1 { e 1 , i , π j = e 2 , i π j } = 1 )
C 1 , b = A 1 , b r 2 r 1 + r 2 + A 2 , b r 1 r 1 + r 2
C 2 , b = A 1 , b r 1 r 1 + r 2 + A 2 , b r 2 r 1 + r 2
A 1 , b = r 1 [ 1 - r 1 ] 2 b - 1 1 - [ 1 - r 1 ] 2 b
A 2 , b = r 2 [ 1 - r 2 ] 2 b - 1 1 - [ 1 - r 2 ] 2 b
r 1 = f 1 D , r 2 = f 2 D , f 1 = | S 1 | , f 2 = | S 2 |
3) connecting position minwise similarity estimates
Definition
Figure BSA00000711392000049
Be illustrated in π jEffect is z down 1(z 2) the lower-order digit i position of rising.Connect n connection bit variable x during definition b position 1,
x 1 = e 1,1 , π 1 e 1,2 , π 1 . . . e 1 , b , π 1 e 1,1 , π 2 e 1,2 , π 2 . . . e 1 , b , π 2 . . . e 1,1 , π c e 1,2 π c . . . e 1 , b , π n ,
x 2 = e 2,1 , π 1 e 2,2 , π 1 . . . e 2 , b , π 1 e 2,1 , π 2 e 2,2 , π 2 . . . e 2 , b , π 2 . . . e 2,1 , π c e 2,2 , π c . . . e 2 , b , π n
Have only and work as e 1 , i , π j = e 2 , i , π j ( i ∈ [ 1 , b ] , j ∈ [ 1 , n ] ) The time, x 1=x 2
Set symbol G B, nExpression x 1=x 2Probability, wherein b is a figure place, n representes linking number, then can get:
G b,n=E b n
Being estimated as of document 1, document 2 similarities then:
R ^ b , n = G ^ b , n 1 n - C 1 , b 1 - C 2 , b
Wherein
Figure BSA000007113920000414
Figure BSA000007113920000415
Below specify the implementation procedure of step 3 with instance 2.
Instance 2 similarities are estimated:
1) the minwise similarity is estimated
The likelihood minwise similarity of S1 and S2 is confirmed as
Figure BSA00000711392000051
2) minwise similarity in b position is estimated
Here get b=1, then f 1=3, f 2=4,
Figure BSA00000711392000052
4 1, b=0.385, A 2, b=0.333, C 1, b=0.367, C 2, b=0.353, E ^ b = 1 k Σ j = 1 k ( Π i = 1 b 1 { e 1 , i , π j = e 2 , i , π j } = 1 ) = 4 6 = 0.667 , Then R ^ b = E ^ b - C 1 , b 1 - C 2 , b = 0.4721 .
3) connecting position fingerprint similarity estimates
If b=1, n=2, then
Figure BSA00000711392000056
R ^ b , n = G ^ b , n - C 1 , b 1 - C 2 , b = 0.3330 .
4) Jie Kade (Jacard) similarity
Figure BSA00000711392000058
Why estimated value is not equal to actual value is because k is too little equally; Shown in Fig. 2 variance curve distributes; When k was very little, variance can be very big, when k is big more; Estimated value
Figure BSA00000711392000059
likewise also can will be more and more approaching with actual value R, and valuation is just accurate more.
The present invention has advantage compared with prior art: can promote b position minwise similarity measurement algorithm exponentially with respect to existing, the present invention can reduction at double compare number of times, has obtained the lifting at double of performance.Below prove this advantage from 3 aspects:
1) variance analysis
The present invention has obtained promoting at double of performance and has had very strong practical application meaning through the minimum trueness error of loss.As shown in Figure 2, when k=1000, for given four kinds selected r 1=r 2(from 10 -10To 0.9), b=1, b=2, n=2, R 1,2And R 2,2The relation of similarity (R)-variance (Var).Connect position R 2,2The variance of variance ratio b=2 want big, precision descends to some extent, but because connected 22, so need the number of times of comparison to reduce half.In the similarity of mass data detected, removing duplicate webpages for example usually had and more than one hundred millionly need carry out the estimation of similarity to webpage, through losing minimum trueness error, has obtained promoting at double of performance and has had very strong practical application meaning.
2) accuracy rate and recall rate analysis
Fig. 3 has shown that connection position similarity measurement algorithm is at similarity R>=R 0Accuracy rate and the experimental result of recall rate.Recall rate curve among Fig. 3 is almost as broad as long, and accuracy rate but has certain difference, analyzes the experimental result of accuracy rate through following two aspects.At first, work as R 0=0.5, accuracy rate is 0.8 o'clock, estimator
Figure BSA000007113920000510
Required k=100,500,700,300,450.With estimator
Figure BSA000007113920000511
for example; If estimator
Figure BSA000007113920000512
will reach identical accuracy rate; Connect the required sample number 700 in position
Figure BSA000007113920000513
greater than the required sample number 500 in b position; But because valuation is carried out in 2 of connections; The number of times of
Figure BSA000007113920000515
comparison only needs 700/2=350 time, and the number of times of
Figure BSA000007113920000516
comparison needs 500 times.But undeniablely be;
Figure BSA000007113920000517
required sample number lacks 200, and then the space of
Figure BSA000007113920000518
storage will be lacked than
Figure BSA000007113920000519
.Secondly, work as R 0=0.5, during k=600, estimator
Figure BSA00000711392000061
Accuracy rate be respectively 0.9,0.88,0.84,0.86,0.79.Still for example with estimator
Figure BSA00000711392000062
; If estimator
Figure BSA00000711392000063
is when identical sample number k=600; The accuracy rate of is 0.88; The rate of accuracy reached to 0.86 of
Figure BSA00000711392000065
; This shows that connection position
Figure BSA00000711392000066
accuracy rate is slightly poorer than
Figure BSA00000711392000067
, but gap is very little.And the number of times of
Figure BSA00000711392000068
comparison only needs 600/2=300 time, and the number of times of
Figure BSA00000711392000069
comparison needs 600 times.And because identical sample number k=600, the space of storage is the same.
Can reach a conclusion from the analysis of accuracy rate and recall rate: when k is big; The of the present invention connection under the quite approaching situation of position minwise similarity measurement algorithm and b position minwise similarity measurement algorithm accuracy rate; Use a connection position similarity measurement algorithm to estimate that similarity can reduce the number of times of comparison, and obtain the lifting of efficient.And under the less situation of k, then connect position similarity measurement algorithm and b position minwise similarity measurement algorithm efficient and space are had his own strong points, can accept or reject according to system requirements.
3) efficiency analysis
Select 10000 documents to carrying out the time-consuming measurement of cpu at random, as shown in Figure 4, wherein test selected k=600.It is minimum required working time that Fig. 4 shows
Figure BSA000007113920000610
; This because as long as the comparison k/2=300 time 1; And
Figure BSA000007113920000611
to compare k=600 time 1;
Figure BSA000007113920000612
comparison k/2 time 2, and
Figure BSA000007113920000613
to compare k time 2.Experimental result has shown that the required cpu of connection position similarity measurement algorithm is still less consuming time, approaches the half the of b position minwise similarity measurement algorithm.Therefore, algorithm described in the present invention can promote the performance of b position minwise similarity measurement algorithm exponentially.

Claims (5)

1. method of searching similar text is characterized in that may further comprise the steps:
Step 1, the text feature extraction step: this step is used to extract the text feature S set Shgs
Step 2, connect the position fingerprint and generate step: this step is used for S ShgsGenerate a connection position fingerprint, be designated as S Dn
Step 3, connect position similarity measurement step: this step is used for the connection position fingerprint similarity of two documents of comparison;
Step 4 is utilized the connection position fingerprint similarity result that obtains, and obtains the text that needs.
2. method of searching similar text according to claim 1 is characterized in that step 1 specifically comprises:
At first, text message is carried out scanning analysis, utilize the Chinese word segmentation algorithm that document is carried out participle, generate set of words; Then, the vocabulary of make up stopping using, and utilize the vocabulary of stopping using to filter out the characteristic set S that branch set of words behind the text noise data is document Shgs
3. according to the described method of searching similar text of claim 1-2, it is characterized in that the concrete steps of step 2 comprise:
At first, form the minwise fingerprint; Then, form b position minwise fingerprint; Form at last and connect the position fingerprint.
4. according to the described connection of claim 1-3 position similarity measurement algorithm, it is characterized in that the concrete steps of step 3 comprise:
Definition z 1, z 2It is the minwise fingerprint S set that a random permutation crowd π acts on document 1, document 2 1And S 2On minimum value:
z 1=min{π(S 1)},z 2=min{π(S 2)},
Definition
Figure FSA00000711391900011
Be illustrated in π jEffect is z down 1(z 1) the lower-order digit i position of rising.Connect n connection bit variable x during definition b position 1, x 2
Figure FSA00000711391900013
Have only and work as
Figure FSA00000711391900014
The time, x 1=x 2
Set symbol G B, nExpression x 1=x 2Probability, wherein b is a figure place, n representes linking number, then can get:
G b,n=E b n
Being estimated as of document 1, document 2 similarities then:
Figure FSA00000711391900015
Wherein
Figure FSA00000711391900016
Figure FSA00000711391900017
5. one kind connects position similarity measurement algorithm, it is characterized in that comprising:
Step 1, the text feature extraction step: this step is used to extract the text feature S set Shgs
Step 2, connect the position fingerprint and generate step: this step is used for S ShgsGenerate a connection position fingerprint, be designated as S Dn
Step 3, connect position similarity measurement step: this step is used for the connection position fingerprint similarity of two documents of comparison.
CN2012101353393A 2012-05-04 2012-05-04 Method for searching similar texts and link bit similarity measuring algorithm Pending CN102682104A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012101353393A CN102682104A (en) 2012-05-04 2012-05-04 Method for searching similar texts and link bit similarity measuring algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012101353393A CN102682104A (en) 2012-05-04 2012-05-04 Method for searching similar texts and link bit similarity measuring algorithm

Publications (1)

Publication Number Publication Date
CN102682104A true CN102682104A (en) 2012-09-19

Family

ID=46814029

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012101353393A Pending CN102682104A (en) 2012-05-04 2012-05-04 Method for searching similar texts and link bit similarity measuring algorithm

Country Status (1)

Country Link
CN (1) CN102682104A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937994A (en) * 2012-11-15 2013-02-20 北京锐安科技有限公司 Similar document query method based on stop words
CN104063502A (en) * 2014-07-08 2014-09-24 中南大学 WSDL semi-structured document similarity analyzing and classifying method based on semantic model
CN104636325A (en) * 2015-02-06 2015-05-20 中南大学 Document similarity determining method based on maximum likelihood estimation
CN104750844A (en) * 2015-04-09 2015-07-01 中南大学 Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts
CN105718430A (en) * 2016-01-13 2016-06-29 湖南工业大学 Grouping minimum value-based method for calculating fingerprint similarity
CN106951765A (en) * 2017-03-31 2017-07-14 福建北卡科技有限公司 A kind of zero authority mobile device recognition methods based on browser fingerprint similarity
CN108829660A (en) * 2018-05-09 2018-11-16 电子科技大学 A kind of short text signature generating method based on random number division and recursion
CN113011194A (en) * 2021-04-15 2021-06-22 电子科技大学 Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN115344846A (en) * 2022-07-29 2022-11-15 贵州电网有限责任公司 Fingerprint retrieval model and verification method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315622A (en) * 2007-05-30 2008-12-03 香港中文大学 System and method for detecting file similarity

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315622A (en) * 2007-05-30 2008-12-03 香港中文大学 System and method for detecting file similarity

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李旭等: "一种基于提取指纹方法的数字文档拷贝检测模型", 《计算机科学》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937994A (en) * 2012-11-15 2013-02-20 北京锐安科技有限公司 Similar document query method based on stop words
CN104063502B (en) * 2014-07-08 2017-03-22 中南大学 WSDL semi-structured document similarity analyzing and classifying method based on semantic model
CN104063502A (en) * 2014-07-08 2014-09-24 中南大学 WSDL semi-structured document similarity analyzing and classifying method based on semantic model
CN104636325A (en) * 2015-02-06 2015-05-20 中南大学 Document similarity determining method based on maximum likelihood estimation
CN104636325B (en) * 2015-02-06 2015-09-30 中南大学 A kind of method based on Maximum-likelihood estimation determination Documents Similarity
CN104750844A (en) * 2015-04-09 2015-07-01 中南大学 Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts
CN105718430A (en) * 2016-01-13 2016-06-29 湖南工业大学 Grouping minimum value-based method for calculating fingerprint similarity
CN105718430B (en) * 2016-01-13 2018-05-04 湖南工业大学 A kind of method for calculating similarity as fingerprint based on packet minimum value
CN106951765A (en) * 2017-03-31 2017-07-14 福建北卡科技有限公司 A kind of zero authority mobile device recognition methods based on browser fingerprint similarity
CN108829660A (en) * 2018-05-09 2018-11-16 电子科技大学 A kind of short text signature generating method based on random number division and recursion
CN108829660B (en) * 2018-05-09 2021-08-31 电子科技大学 Short text signature generation method based on random number division and recursion
CN113011194A (en) * 2021-04-15 2021-06-22 电子科技大学 Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN113011194B (en) * 2021-04-15 2022-05-03 电子科技大学 Text similarity calculation method fusing keyword features and multi-granularity semantic features
CN115344846A (en) * 2022-07-29 2022-11-15 贵州电网有限责任公司 Fingerprint retrieval model and verification method

Similar Documents

Publication Publication Date Title
CN102682104A (en) Method for searching similar texts and link bit similarity measuring algorithm
CN104636325B (en) A kind of method based on Maximum-likelihood estimation determination Documents Similarity
CN105653706A (en) Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN105718506B (en) A kind of method of science and technology item duplicate checking comparison
CN101561813B (en) Method for analyzing similarity of character string under Web environment
CN104239513B (en) A kind of semantic retrieving method of domain-oriented data
CN102419778B (en) Information searching method for discovering and clustering sub-topics of query statement
CN104991905B (en) A kind of mathematic(al) representation search method based on level index
CN103279478B (en) A kind of based on distributed mutual information file characteristics extracting method
CN102955857B (en) Class center compression transformation-based text clustering method in search engine
CN103617157A (en) Text similarity calculation method based on semantics
CN104317801A (en) Data cleaning system and method for aiming at big data
CN103530321A (en) Sequencing system based on machine learning
CN104008090A (en) Multi-subject extraction method based on concept vector model
CN103778227A (en) Method for screening useful images from retrieved images
CN108038090B (en) A kind for the treatment of method and apparatus of Text Address
CN206411669U (en) SaaS ancient book knowledge service cloud platform
CN110866125A (en) Knowledge graph construction system based on bert algorithm model
CN111221976A (en) Knowledge graph construction method based on bert algorithm model
WO2023109143A1 (en) Real store verification method and apparatus, device, and storage medium
CN106096014A (en) The Text Clustering Method of mixing length text set based on DMR
CN105373521A (en) Minwise Hash based dynamic multi-threshold-value text similarity filtering and calculating method
CN107480130B (en) Method for judging attribute value identity of relational data based on WEB information
CN108153736B (en) Relation word mapping method based on vector space model
CN103150371B (en) Forward and reverse training goes to obscure text searching method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120919