CN102682104A - Method for searching similar texts and link bit similarity measuring algorithm - Google Patents
Method for searching similar texts and link bit similarity measuring algorithm Download PDFInfo
- Publication number
- CN102682104A CN102682104A CN2012101353393A CN201210135339A CN102682104A CN 102682104 A CN102682104 A CN 102682104A CN 2012101353393 A CN2012101353393 A CN 2012101353393A CN 201210135339 A CN201210135339 A CN 201210135339A CN 102682104 A CN102682104 A CN 102682104A
- Authority
- CN
- China
- Prior art keywords
- fingerprint
- similarity
- text
- document
- connect
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Collating Specific Patterns (AREA)
Abstract
The invention discloses a method for searching similar texts, which comprises the following steps: step 1, text feature extraction: the step is used for extracting a text characteristic set Sshgs; step 2, link bit fingerprint generation: the step is used for generating a link bit fingerprint from the Sshgs, and the link bit fingerprint is recorded as Sdn; step 3, link bit similarity measurement: the step is used for comparing the similarity of the link bit fingerprints of two texts; and step 4, according to the result of the link bit similarity, a needed text is obtained. The invention also correspondingly discloses a link bit similarity measuring algorithm, and according to the experimental data, the facts that under the condition of slightly sacrificing precision, the comparison times are exponentially reduced in the algorithm and the performance of the algorithm is improved are proved.
Description
Technical field
The present invention relates to information retrieval field, relate in particular to a kind of method of estimating similarity, the measuring similarity that can be applicable between the magnanimity document is estimated, is specially adapted to search fast in the magnanimity information similar text message.
Technical background
The fast development of Internet technology makes that the data message on the network presents exponential growth, how in the information of magnanimity, to search effective information fast, and it is more and more important to become.Text similarity this notion of tolerance and correlation technique are also arisen at the historic moment.A good text similarity measure has great importance in research fields such as automatically request-answering system, intelligent retrieval, removing duplicate webpages, natural language processings.
Text similarity is meant a metric parameter of the matching degree between two or more texts, and similarity is high more, representes that the similarity degree between two texts is big more, otherwise low more.To be vector space model (VSM) wait to look into the frequency vector inner product that document and a certain piece of writing of data centralization document have weight through calculating to the traditional text method for measuring similarity, obtains the similarity of two pieces of documents.Algorithm need be stored shortcomings such as number of characteristics vocabulary, comparison speed is slow, accuracy rate is low, can't be applied to measuring similarity in the mass data.Based on minwise similarity measurement algorithm through the similarity problem being converted into the probability of happening problem of an incident; This method is mapped to the text feature lexical set in the hash value set; Character string comparison problem is converted into numeric ratio to problem; Be applicable to the mass data measuring similarity, but algorithm need be compared a large amount of fingerprints, take a large amount of storage spaces.2010; People such as Ping Li improve on the basis of minwise similarity measurement algorithm; Proposed b position minwise similarity measurement algorithm, this algorithm is estimated the similarity of two documents through using still less b position, but algorithm still need be compared a large amount of fingerprints.
Summary of the invention
The present invention proposes a kind of new method of searching similar text, to overcome aforementioned all deficiencies of the prior art.
According to the method for the invention, may further comprise the steps:
Step 4 is utilized the connection position fingerprint similarity result that obtains, and obtains the text that needs.
Description of drawings
Fig. 1 is a main schematic flow sheet according to the method for the invention
Fig. 2 is the graph of a relation that connects a similarity and variance according to an embodiment of the invention
Fig. 3 connects the accuracy rate of position and the experimental result of calling rate according to embodiments of the invention in the XX data centralization
Fig. 4 be according to embodiments of the invention the XX data set actual efficiency comparison figure
Embodiment
Below with reference to accompanying drawing method provided by the invention is described in detail, and will will carry out bright specifically to the advantage of the method according to this invention in conjunction with instance and experimental data.Experiment shows that method of the present invention is being sacrificed under the situation of very little precision slightly, but can reduce the number of times of comparison exponentially, promotes and searches performance.
The method of searching similar text that the present invention proposes specifically comprises the steps:
Step 4 is utilized the connection position fingerprint similarity result that obtains, and obtains the text that needs.
Preferably, in step 1, specifically comprise:
At first, text message is carried out scanning analysis, utilize the Chinese word segmentation algorithm that document is carried out participle, generate set of words; Then, the vocabulary of make up stopping using, and utilize the vocabulary of stopping using to filter out the characteristic set Sshgs that branch set of words behind the text noise data is document.Noise is insignificant word in the text, generally is the low adopted auxiliary word of high frequency, function word etc.;
Preferably, in step 2, specifically comprise:
1) forms the minwise fingerprint
File characteristics S set to the step 1 generation
ShgsAdopt the Rabin function, shine upon 32 integer, mapping back set called after S
dSuppose complete or collected works Ω=0,1 ..., D-1}, a
0a
1... a
D-1An arrangement on the Hang Seng Index Ω, vector (a
0, a
1..., a
D-1) represent the displacement of Ω:
If for data set X ∈ Ω and x ∈ X, exist one to arrange π, make
Then π is a minwise arrangement at random.In other words, any element x among the data set X is in the minimum value that all has under the displacement π after identical probability is this displacement.Like this, the permutation group π through k independent random
1, π
2..., π
k, just S set
dConvert the minwise characteristic fingerprint into:
2) form b position minwise fingerprint
Defined function: B (x, b)=x&&2
B-1, B (x, b) for getting the b bit function, the b in the function for the figure place that will get.
gets the b position for each element in
, forms b position minwise characteristic fingerprint:
3) form a connection position fingerprint
Below specify the process of step 2 with instance 1, specialize be instance among the application only as the effect of example description, do not constitute qualification of the present invention.
1) forms the minwise fingerprint
Pass through π
1, π
2, π
3, π
4, π
5, π
6To S
1After the mapping be:
π
1(S
1)={3,0,6},π
2(S
1)={6,5,2},π
3(S
1)={1,7,6},π
4(S
1)={1,5,3},π
5(S
1)={7,6,4},π
6(S
1)={1,5,3};
The minwise fingerprint of document 1 is:
Pass through π
1, π
2, π
3, π
4, π
5, π
6To S
2After the mapping be:
π
1(S
2)={3,6,4,1},π
2(S
2)={6,2,7,4},π
3(S
2)={1,6,2,4},π
4(S
2)={1,3,4,6},π
5(S
2)={7,4,0,1},π
6(S
2)={1,3,0,7};
The minwise fingerprint of document 2 is:
Therefore, S
1And S
2At π
1, π
2, π
3, π
4, π
5, π
6The minwise set that generates behind the random permutation is respectively
With
2) form b position minwise fingerprint
After
got the b=1 position, the b position minwise fingerprint of trying to achieve:
3) form a connection position fingerprint
Right
Connect n=2 b position fingerprint: S
1n={ 0-0,1-1,0-1}={00,11,01}
Right
Connect n=2 b position fingerprint: S
2n={ 1-0,1-1,0-0}={10,11,00}
Preferably, step 3 specifically comprises:
1) the minwise similarity is estimated
In minwise similarity measurement algorithm, the nothing of the likelihood R of two documents estimates that partially
is:
2) minwise similarity in b position is estimated
Definition z
1, z
2Be that a random permutation crowd π acts on S set
1And S
2On minimum value:
z
1=min{π(S
1)},z
2=min{π(S
2)}
e
1, iBe z
1Minimum i position, e
2, iBe z
2Minimum i position.In b position minwise similarity was estimated, the nothing of the similarity of two documents was estimated partially:
Wherein
3) connecting position minwise similarity estimates
Definition
Be illustrated in π
jEffect is z down
1(z
2) the lower-order digit i position of rising.Connect n connection bit variable x during definition b position
1,
Have only and work as
The time, x
1=x
2
Set symbol G
B, nExpression x
1=x
2Probability, wherein b is a figure place, n representes linking number, then can get:
G
b,n=E
b n,
Being estimated as of document 1, document 2 similarities then:
Wherein
Below specify the implementation procedure of step 3 with instance 2.
1) the minwise similarity is estimated
2) minwise similarity in b position is estimated
3) connecting position fingerprint similarity estimates
If b=1, n=2, then
4) Jie Kade (Jacard) similarity
Why estimated value is not equal to actual value is because k is too little equally; Shown in Fig. 2 variance curve distributes; When k was very little, variance can be very big, when k is big more; Estimated value
likewise also can will be more and more approaching with actual value R, and valuation is just accurate more.
The present invention has advantage compared with prior art: can promote b position minwise similarity measurement algorithm exponentially with respect to existing, the present invention can reduction at double compare number of times, has obtained the lifting at double of performance.Below prove this advantage from 3 aspects:
1) variance analysis
The present invention has obtained promoting at double of performance and has had very strong practical application meaning through the minimum trueness error of loss.As shown in Figure 2, when k=1000, for given four kinds selected r
1=r
2(from 10
-10To 0.9), b=1, b=2, n=2, R
1,2And R
2,2The relation of similarity (R)-variance (Var).Connect position R
2,2The variance of variance ratio b=2 want big, precision descends to some extent, but because connected 22, so need the number of times of comparison to reduce half.In the similarity of mass data detected, removing duplicate webpages for example usually had and more than one hundred millionly need carry out the estimation of similarity to webpage, through losing minimum trueness error, has obtained promoting at double of performance and has had very strong practical application meaning.
2) accuracy rate and recall rate analysis
Fig. 3 has shown that connection position similarity measurement algorithm is at similarity R>=R
0Accuracy rate and the experimental result of recall rate.Recall rate curve among Fig. 3 is almost as broad as long, and accuracy rate but has certain difference, analyzes the experimental result of accuracy rate through following two aspects.At first, work as R
0=0.5, accuracy rate is 0.8 o'clock, estimator
Required k=100,500,700,300,450.With estimator
for example; If estimator
will reach identical accuracy rate; Connect the required sample number 700 in position
greater than the required sample number 500 in b position; But because valuation is carried out in 2 of
connections; The number of times of
comparison only needs 700/2=350 time, and the number of times of
comparison needs 500 times.But undeniablely be;
required sample number lacks 200, and then the space of
storage will be lacked than
.Secondly, work as R
0=0.5, during k=600, estimator
Accuracy rate be respectively 0.9,0.88,0.84,0.86,0.79.Still for example with estimator
; If estimator
is when identical sample number k=600; The accuracy rate of
is 0.88; The rate of accuracy reached to 0.86 of
; This shows that connection position
accuracy rate is slightly poorer than
, but gap is very little.And the number of times of
comparison only needs 600/2=300 time, and the number of times of
comparison needs 600 times.And because identical sample number k=600, the space of storage is the same.
Can reach a conclusion from the analysis of accuracy rate and recall rate: when k is big; The of the present invention connection under the quite approaching situation of position minwise similarity measurement algorithm and b position minwise similarity measurement algorithm accuracy rate; Use a connection position similarity measurement algorithm to estimate that similarity can reduce the number of times of comparison, and obtain the lifting of efficient.And under the less situation of k, then connect position similarity measurement algorithm and b position minwise similarity measurement algorithm efficient and space are had his own strong points, can accept or reject according to system requirements.
3) efficiency analysis
Claims (5)
1. method of searching similar text is characterized in that may further comprise the steps:
Step 1, the text feature extraction step: this step is used to extract the text feature S set
Shgs
Step 2, connect the position fingerprint and generate step: this step is used for S
ShgsGenerate a connection position fingerprint, be designated as S
Dn
Step 3, connect position similarity measurement step: this step is used for the connection position fingerprint similarity of two documents of comparison;
Step 4 is utilized the connection position fingerprint similarity result that obtains, and obtains the text that needs.
2. method of searching similar text according to claim 1 is characterized in that step 1 specifically comprises:
At first, text message is carried out scanning analysis, utilize the Chinese word segmentation algorithm that document is carried out participle, generate set of words; Then, the vocabulary of make up stopping using, and utilize the vocabulary of stopping using to filter out the characteristic set S that branch set of words behind the text noise data is document
Shgs
3. according to the described method of searching similar text of claim 1-2, it is characterized in that the concrete steps of step 2 comprise:
At first, form the minwise fingerprint; Then, form b position minwise fingerprint; Form at last and connect the position fingerprint.
4. according to the described connection of claim 1-3 position similarity measurement algorithm, it is characterized in that the concrete steps of step 3 comprise:
Definition z
1, z
2It is the minwise fingerprint S set that a random permutation crowd π acts on document 1, document 2
1And S
2On minimum value:
z
1=min{π(S
1)},z
2=min{π(S
2)},
Definition
Be illustrated in π
jEffect is z down
1(z
1) the lower-order digit i position of rising.Connect n connection bit variable x during definition b position
1, x
2
Set symbol G
B, nExpression x
1=x
2Probability, wherein b is a figure place, n representes linking number, then can get:
G
b,n=E
b n,
Being estimated as of document 1, document 2 similarities then:
Wherein
5. one kind connects position similarity measurement algorithm, it is characterized in that comprising:
Step 1, the text feature extraction step: this step is used to extract the text feature S set
Shgs
Step 2, connect the position fingerprint and generate step: this step is used for S
ShgsGenerate a connection position fingerprint, be designated as S
Dn
Step 3, connect position similarity measurement step: this step is used for the connection position fingerprint similarity of two documents of comparison.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012101353393A CN102682104A (en) | 2012-05-04 | 2012-05-04 | Method for searching similar texts and link bit similarity measuring algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012101353393A CN102682104A (en) | 2012-05-04 | 2012-05-04 | Method for searching similar texts and link bit similarity measuring algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102682104A true CN102682104A (en) | 2012-09-19 |
Family
ID=46814029
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2012101353393A Pending CN102682104A (en) | 2012-05-04 | 2012-05-04 | Method for searching similar texts and link bit similarity measuring algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102682104A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102937994A (en) * | 2012-11-15 | 2013-02-20 | 北京锐安科技有限公司 | Similar document query method based on stop words |
CN104063502A (en) * | 2014-07-08 | 2014-09-24 | 中南大学 | WSDL semi-structured document similarity analyzing and classifying method based on semantic model |
CN104636325A (en) * | 2015-02-06 | 2015-05-20 | 中南大学 | Document similarity determining method based on maximum likelihood estimation |
CN104750844A (en) * | 2015-04-09 | 2015-07-01 | 中南大学 | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts |
CN105718430A (en) * | 2016-01-13 | 2016-06-29 | 湖南工业大学 | Grouping minimum value-based method for calculating fingerprint similarity |
CN106951765A (en) * | 2017-03-31 | 2017-07-14 | 福建北卡科技有限公司 | A kind of zero authority mobile device recognition methods based on browser fingerprint similarity |
CN108829660A (en) * | 2018-05-09 | 2018-11-16 | 电子科技大学 | A kind of short text signature generating method based on random number division and recursion |
CN113011194A (en) * | 2021-04-15 | 2021-06-22 | 电子科技大学 | Text similarity calculation method fusing keyword features and multi-granularity semantic features |
CN115344846A (en) * | 2022-07-29 | 2022-11-15 | 贵州电网有限责任公司 | Fingerprint retrieval model and verification method |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101315622A (en) * | 2007-05-30 | 2008-12-03 | 香港中文大学 | System and method for detecting file similarity |
-
2012
- 2012-05-04 CN CN2012101353393A patent/CN102682104A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101315622A (en) * | 2007-05-30 | 2008-12-03 | 香港中文大学 | System and method for detecting file similarity |
Non-Patent Citations (1)
Title |
---|
李旭等: "一种基于提取指纹方法的数字文档拷贝检测模型", 《计算机科学》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102937994A (en) * | 2012-11-15 | 2013-02-20 | 北京锐安科技有限公司 | Similar document query method based on stop words |
CN104063502B (en) * | 2014-07-08 | 2017-03-22 | 中南大学 | WSDL semi-structured document similarity analyzing and classifying method based on semantic model |
CN104063502A (en) * | 2014-07-08 | 2014-09-24 | 中南大学 | WSDL semi-structured document similarity analyzing and classifying method based on semantic model |
CN104636325A (en) * | 2015-02-06 | 2015-05-20 | 中南大学 | Document similarity determining method based on maximum likelihood estimation |
CN104636325B (en) * | 2015-02-06 | 2015-09-30 | 中南大学 | A kind of method based on Maximum-likelihood estimation determination Documents Similarity |
CN104750844A (en) * | 2015-04-09 | 2015-07-01 | 中南大学 | Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts |
CN105718430A (en) * | 2016-01-13 | 2016-06-29 | 湖南工业大学 | Grouping minimum value-based method for calculating fingerprint similarity |
CN105718430B (en) * | 2016-01-13 | 2018-05-04 | 湖南工业大学 | A kind of method for calculating similarity as fingerprint based on packet minimum value |
CN106951765A (en) * | 2017-03-31 | 2017-07-14 | 福建北卡科技有限公司 | A kind of zero authority mobile device recognition methods based on browser fingerprint similarity |
CN108829660A (en) * | 2018-05-09 | 2018-11-16 | 电子科技大学 | A kind of short text signature generating method based on random number division and recursion |
CN108829660B (en) * | 2018-05-09 | 2021-08-31 | 电子科技大学 | Short text signature generation method based on random number division and recursion |
CN113011194A (en) * | 2021-04-15 | 2021-06-22 | 电子科技大学 | Text similarity calculation method fusing keyword features and multi-granularity semantic features |
CN113011194B (en) * | 2021-04-15 | 2022-05-03 | 电子科技大学 | Text similarity calculation method fusing keyword features and multi-granularity semantic features |
CN115344846A (en) * | 2022-07-29 | 2022-11-15 | 贵州电网有限责任公司 | Fingerprint retrieval model and verification method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102682104A (en) | Method for searching similar texts and link bit similarity measuring algorithm | |
CN104636325B (en) | A kind of method based on Maximum-likelihood estimation determination Documents Similarity | |
CN105653706A (en) | Multilayer quotation recommendation method based on literature content mapping knowledge domain | |
CN105718506B (en) | A kind of method of science and technology item duplicate checking comparison | |
CN101561813B (en) | Method for analyzing similarity of character string under Web environment | |
CN104239513B (en) | A kind of semantic retrieving method of domain-oriented data | |
CN102419778B (en) | Information searching method for discovering and clustering sub-topics of query statement | |
CN104991905B (en) | A kind of mathematic(al) representation search method based on level index | |
CN103279478B (en) | A kind of based on distributed mutual information file characteristics extracting method | |
CN102955857B (en) | Class center compression transformation-based text clustering method in search engine | |
CN103617157A (en) | Text similarity calculation method based on semantics | |
CN104317801A (en) | Data cleaning system and method for aiming at big data | |
CN103530321A (en) | Sequencing system based on machine learning | |
CN104008090A (en) | Multi-subject extraction method based on concept vector model | |
CN103778227A (en) | Method for screening useful images from retrieved images | |
CN108038090B (en) | A kind for the treatment of method and apparatus of Text Address | |
CN206411669U (en) | SaaS ancient book knowledge service cloud platform | |
CN110866125A (en) | Knowledge graph construction system based on bert algorithm model | |
CN111221976A (en) | Knowledge graph construction method based on bert algorithm model | |
WO2023109143A1 (en) | Real store verification method and apparatus, device, and storage medium | |
CN106096014A (en) | The Text Clustering Method of mixing length text set based on DMR | |
CN105373521A (en) | Minwise Hash based dynamic multi-threshold-value text similarity filtering and calculating method | |
CN107480130B (en) | Method for judging attribute value identity of relational data based on WEB information | |
CN108153736B (en) | Relation word mapping method based on vector space model | |
CN103150371B (en) | Forward and reverse training goes to obscure text searching method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120919 |