CN103646080A - Microblog duplication-eliminating method and system based on reverse-order index - Google Patents

Microblog duplication-eliminating method and system based on reverse-order index Download PDF

Info

Publication number
CN103646080A
CN103646080A CN201310681714.9A CN201310681714A CN103646080A CN 103646080 A CN103646080 A CN 103646080A CN 201310681714 A CN201310681714 A CN 201310681714A CN 103646080 A CN103646080 A CN 103646080A
Authority
CN
China
Prior art keywords
segmentation
order index
hamming
inverted order
signature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310681714.9A
Other languages
Chinese (zh)
Inventor
王鑫文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201310681714.9A priority Critical patent/CN103646080A/en
Publication of CN103646080A publication Critical patent/CN103646080A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a microblog duplication-eliminating method and system based on reverse-order index. The method comprises the steps as follows: a text is subjected to word segmentation by a model training module according to lexicon data; the text is subjected to word frequency statistics by a simhash module according to a result after the word segmentation and is converted into an N-dimensional vector, and simhash calculation is performed on the N-dimensional vector, so that an f-bit binary signature is obtained; a duplication-eliminating calculation module executes the following operation: the f-bit binary signature is segmented according to set parameters, and the reverse-order index is established according to a segmentation result; signature collection of first segmentation is searched segmentally according to the established reverse-order index, and a corresponding hamming distance in the first segmentation is calculated; and whether the calculated hamming distance in the first segmentation is in the set parameter range is determined.

Description

Microblogging duplicate removal method and system based on inverted order index
Technical field
The present invention relates to the information analysis field based on microblogging, and relate to particularly the microblogging duplicate removal method and system based on inverted order index.
Background technology
Along with the development of internet, microblogging is becoming the main channel of Information Communication, ordinary consumer feedback problem and complaint.For enterprise, timely active process microblogging institute's reflection problem and to stop a large amount of diffusions of negative information be the main task of customer service department of enterprise and the department of public relations, and will directly affect brand image and the commercial value of enterprise.Ageing and the validity of a large amount of microbloggings that information analysis system grabs at microblogging will directly affect treatment effeciency and the promptness of customer service department and the department of public relations.
For fear of the appearance of duplicate contents, need to repeat judgement (being called for short " sentencing heavily "), to reduce storage, strengthen counting yield and to improve user's experience.To sentencing heavily of microblogging content of text, existing technical scheme is mainly taked the methods such as character string comparison editing distance, the calculating of cosine law similarity, simhash duplicate removal.
About character string comparison edit distance approach, the method is based on convert another required minimum editing operation number of times to by one between two character strings.The editing operation of license comprises a character replacement is become to another character, inserts a character, deletes a character.For example character string A content is x 1x 2x 3x 4x 5, character string B content is y 1y 2y 3y 4y 5if, B is become to A and need to edit number M, similarity is 1-M/N, and wherein N is string length, and similarity more approaches 1, illustrates more similar.
About cosine law similarity calculating method, model dictionary, the word recording according to dictionary carries out participle to microblogging data, after participle, participle statistics word occurrence number is added up, for example text Z 1c1, Z 1c2, Z 1c3, Z 1c4z 1cn; Their numbers in text are: Z 1n1, Z 1n2, Z 1n3z 1nm, another text Z 2c1, Z 2c2, Z 2c3, Z 2c4z 2cn; Their numbers in chapters and sections are: Z 2n1, Z 2n2, Z 2n3z 2nm, like this two text-converted are become to two vectors, between two vectors, can calculate its similarity by the cosine law, computing formula is as follows:
Figure BDA0000436150440000021
Result of calculation more approaches 1 and shows that similarity is higher.
About simhash duplicate removal method, by after microblogging Chinese word segmentation, being converted to vector value is the N dimensional vector of word frequency, and the input of simhash computing is this vector, and output is the signature value of a f position, by calculating the hamming distance of two signature values, by judging hamming distance, whether within setup parameter scope, if within setting range, judge that these two texts are similar, feature weight is word frequency, then this vector is converted to a signature value.As shown in Figure 1, as shown in Figure 2, wherein simhash process is whole duplicate removal process flow diagram Simhash process:
1. the vectorial V of a f dimension is initialized as to 0; The binary number S of f position is initialized as 0;
2. to each feature: with traditional hash algorithm, this feature is produced the signature b of a f position.I=1 is arrived to f:
If the i position of b is 1, i the element of V adds the weight of this feature;
Otherwise i the element of V deducts the weight of this feature.
3. if i the element of V is greater than 0, the i position of S is 1, otherwise is 0;
4. output S is as signature.
The shortcoming of prior art scheme
The in the situation that of a large amount of microblogging data, any duplicate removal method efficiency is all lower, especially when microblogging captures into information analysis system, also need judgement in system, whether to have the microblogging similar to this microblogging (forwarding microblogging), at this moment operand is excessive, can directly affect the ageing of microblogging.
For above-mentioned duplicate removal method, all relatively to determine whether between two repetition based on text, for present internet information, in the very large situation of microblogging data of every day, at information analysis system, grab after microblogging, the computing that determines whether repetition is just very huge, take that first to have microblogging data N bar be example, system grabs, after microblogging, determines whether repetition, the worst need to comparison N time, just can judge whether repetition.Such operation efficiency is too low.
The shortcoming existing based on prior art, we have proposed a kind of simhash duplicate removal method based on inverted order index, and the method is a kind of improvement algorithm based on simhash, can meet the assurance of operation efficiency under large data operation.The present invention has made up the inefficiency of duplicate removal method for large data operation, has catered to the effective refinement for microblogging data, has improved enterprise in reply micro-blog information diffusion promptness.
Summary of the invention
According to one embodiment of present invention, provide a kind of method of the microblogging duplicate removal based on inverted order index, described method comprises: by model training module, according to dictionary data, text is carried out to participle; By simhash module, according to the result after participle, text is carried out to word frequency statistics to be converted into N dimensional vector, and described N dimensional vector is carried out to simhash computing to obtain the binary signature of f position; By duplicate removal computing module, carry out following operation: according to setup parameter, by the binary signature segmentation of described f position, and set up inverted order index according to segmentation result; According to set up inverted order index, carry out the signature set under passage retrieval the first segmentation, and the calculating hamming distance corresponding with the signature set of described the first segmentation; And determine that the hamming calculating in described the first segmentation distance is whether within the scope of described setup parameter.
Preferably, described method further comprises: if the hamming calculating distance not in the parameter area of described setting, thinks that described text does not repeat by described fragmented storage in inverted order index stores module.
Preferably, described method further comprises: if in the parameter area of the distance of the hamming calculating in described the first segmentation in described setting, according to set up inverted order index, carry out the signature set under passage retrieval the second segmentation and calculate the hamming distance corresponding with the signature set of described the second segmentation; And determine that the hamming calculating in described the second segmentation distance is whether within the scope of described setup parameter.
Preferably, the number of described segmentation is greater than the value of the parameter of described setting.
Preferably, the parameter area of described setting is 0-7.
According to another embodiment of the invention, provide a kind of system of the microblogging duplicate removal based on inverted order index, described system comprises: model training module, and described model training module is configured to, according to dictionary data, text is carried out to participle; Simhash module, described simhash module is configured to, according to the result after participle, text is carried out to word frequency statistics to be converted into N dimensional vector, and described N dimensional vector is carried out to simhash computing to obtain the binary signature of f position; Duplicate removal computing module, described duplicate removal computing module is configured to carry out following operation: according to setup parameter, by the binary signature segmentation of described f position, and set up inverted order index according to segmentation result; According to set up inverted order index, carry out the signature set under passage retrieval the first segmentation, and the calculating hamming distance corresponding with the signature set of described the first segmentation; And determine that the hamming calculating in described the first segmentation distance is whether within the scope of described setup parameter.
Preferably, described duplicate removal computing module is further configured to: if the hamming calculating distance not in the parameter area of described setting, thinks that described text does not repeat by described fragmented storage in inverted order index stores module.
Preferably, described duplicate removal computing module is further configured to: if in the parameter area of the distance of the hamming calculating in described the first segmentation in described setting, according to set up inverted order index, carry out the signature set under passage retrieval the second segmentation and calculate the hamming distance corresponding with the signature set of described the second segmentation; And determine that the hamming calculating in described the second segmentation distance is whether within the scope of described setup parameter
Preferably, the number of described segmentation is greater than the value of the parameter of described setting.
Preferably, the parameter area of described setting is 0-7.
According to duplicate removal technical scheme of the present invention, can, when reducing room and time complicacy, guarantee the degree of accuracy of calculating.According to the detailed description below of the disclosure and accompanying drawing, other object, feature and advantage will be apparent to those skilled in the art.
Accompanying drawing explanation
Accompanying drawing illustrates embodiments of the invention, and is used from and explains principle of the present invention with instructions one.In the accompanying drawings:
Fig. 1 is the schematic diagram of simhash Hash procedure.
Fig. 2 is the process flow diagram that simhash duplicate removal is processed.
Fig. 3 is the block diagram of the system of the simhash duplicate removal based on inverted order index according to an embodiment of the invention.
Fig. 4 A is the schematic diagram of inverted order index according to an embodiment of the invention.
Fig. 4 B is the schematic diagram of the example of inverted order index according to an embodiment of the invention.
Fig. 5 is the process flow diagram of the simhash microblogging duplicate removal method based on inverted order index according to an embodiment of the invention.
Embodiment
Explain in detail below with reference to accompanying drawings technical scheme according to an embodiment of the invention.
Term " micro-blog information supervisory system " refers to by integrating internet information acquisition technology and information intelligent treatment technology microblogging website is captured fast as used herein, by natural language processing technique, data are carried out the processing such as duplicate removal, rubbish filtering, cluster, form valuable data message, thereby grasp information branch of consumer groups for client comprehensively, make correct information guiding, analysis foundation is provided.
The technical scheme of the simhash microblogging duplicate removal method based on inverted order index disclosed by the invention is improved new technical scheme on the basis of original simhash duplicate removal method.
Term " participle " refers to continuous word sequence is reassembled into the process of word sequence according to certain standard as used herein.In order to carry out Chinese information filtration, first will carry out Chinese word segmentation to text pre-service, be expressed as calculating the model with reasoning.Chinese word segmentation is exactly that Chinese Chinese character sequence is divided into significant word.Participle is a part for Chinese information processing, and participle itself is not object, but the necessary stage of subsequent processes is the basic technology of Chinese information processing.Although the Chinese text of take is in the present invention illustrated as example, but those skilled in the art understand, described text is not limited only to Chinese text, and any text based on determining the language on word border all can be applied technical scheme of the present invention, such as Japanese text, Korean text etc.
Although there is various minutes word algorithms, for a ripe Words partition system, can not rely on separately some algorithms to realize, all need comprehensive different algorithm, in actual application, select according to specific circumstances different participle schemes.The accuracy of participle is related to the quality of result for retrieval.The key step that Chinese lexical analysis is taked is at present: first take the methods such as maximum coupling, shortest path, probability statistics or full cutting, obtain a relatively good rough segmentation result, then arrange discrimination, unregistered word identification, finally mark part of speech.In actual system, these three processes may mutually intersect, repeatedly merge, and also may not have obvious precedence.
Although participle accuracy is very important concerning duplicate removal, if participle speed is too slow, even if accuracy is high again, for information analysis system, be also disabled.Because information analysis system need to be processed hundreds of millions of webpages, if the overlong time that participle consumes can have a strong impact on the speed of information analysis system content update.So for information analysis system, the accuracy of participle and speed, the two all needs the requirement that reaches very high.
Term " word frequency " refers to the frequency that in a sentence or one piece of article, various words occur as used herein, and it is a basic fundamental of Chinese information processing, in a lot of fields, has important application.From in form, word is the combination of stable word, and therefore, in context, the number of times that adjacent word occurs is simultaneously more, just more likely forms a word.So word and the frequency of the adjacent co-occurrence of word or the confidence level that probability can be reacted into word preferably.Remove conventional especially word, the word that in one piece of article, the frequency of occurrences is higher can reflect the theme of this piece of article conventionally, therefore can to Chinese article, carry out text cluster by word frequency.
In addition,, in normal situation, closely similar web page contents can not provide fresh information maybe can only provide a small amount of fresh information to user to user, but can consume a large amount of server resources to the processing of pixel web page contents., should consider meanwhile, if certain webpage repeatability is very high, show that this content is more welcome, also indicate that this webpage is relatively important, should give higher weight.
Hamming distance refers to the different figure place of encoding on two corresponding positions of legitimate code in information coding as used herein.The different bit number of corresponding bit value of two code words is called the hamming distance of these two code words.An efficient coding is concentrated, and the minimum value of the hamming distance of any two code words is called the hamming distance of this coded set.The different number of the number of bits of two documents is more, and hamming distance is larger.Hamming distance is larger, illustrates that two document dissimilarities are larger, otherwise, less.Different systems may judge with different hamming distance values the whether approximate repetition of two web page contents.Conventionally, for the binary numeral of 64, hamming distance is less than or equal to 3(≤3) as judging whether the approximate standard repeating.For example: 10101 and 00110 has first, the 4th, the 5th difference successively since first, hamming distance is 3.If setting parameter is 3, can judge that these two sections of texts repeat.
Technical scheme according to the present invention is carried out segmentation by the signature obtaining from simhash computing and is set up inverted order index and with piecewise one by one, sentence the calculating of heavy and hamming distance.Principle of the present invention and principle of pigeon hole are similar, and in principle of pigeon hole, 5 pigeons are placed on 4 cages must a pigeon >2 in cage.If be 7 by setting parameter in the present invention, this is equivalent to 7 different binary digits to put into 8 segmentations, must have so a segmentation to equate.Duplicate removal based on inverted order index of the present invention is according to this principle.
Fig. 3 is the block diagram of the system 300 of the simhash duplicate removal based on inverted order index according to an embodiment of the invention.As shown in Figure 3, system 300 comprises data management module 301, model training module 303, simhash Hash module 305, duplicate removal computing module 307 and inverted order index stores module 309.
Data management module 301 carries out dictionary management for the message content to from meagre collection.Described message content comprises the information such as microblogging content, forwarding relation, bloger ID, issuing time.
Model training module 303 is for carrying out microblogging Chinese word segmentation to manage the content of submodule 302 from dictionary.Simhash Hash module 305 is for carrying out vector conversion and simhash computing.For example " the simhash microblogging duplicate removal method based on inverted order index ", word segmentation result be " based on, inverted order, index, simhash, microblogging, duplicate removal, method ", respective weights is respectively (1,1,1,1,1,1,1), this is 7 dimensional vectors.
Duplicate removal computing module 307 for carrying out, block by hash value, segment lookup and hamming be apart from calculating.Particularly, duplicate removal computing module 307 is according to the parameter of setting by the binary signature segmentation of this f position, and wherein, the number of described segmentation is greater than the value of the parameter of described setting, and according to segmentation result, set up inverted order index and carry out passage retrieval and hamming apart from calculating, to repeat judgement.That is,, if repeated, return to judged result and repeat; If do not repeated, the signature set based under the next segmentation of inverted order indexed search, a to the last segmentation by that analogy.
For example, duplicate removal computing module 307 will wait that sentencing heavy signature is divided into 8 segmentations, and the inverted order index that foundation is set up is according to the signature set under this segmentation of first passage retrieval, one by one calculate the hamming distance corresponding with signature set and repeat with judgement, until travel through all index set equating with it.Whether the hamming distance that then, duplicate removal computing module 307 judgement is calculated is within the scope of setup parameter.If the hamming calculating distance is within the scope of setup parameter, judgement repeats and returns to judged result and repeat; If the hamming calculating distance is not within the scope of setup parameter, judgement does not repeat, and according to the signature set under second segmentation of second passage retrieval, by that analogy until the 8th section.
More specifically, first, the f position signature segmentation by obtaining from simhash Hash module, for example, be divided into 8 sections, then each section is mapped to this signature, as shown in Figure 4 A.Referring to Fig. 4 A, it is the schematic diagram of inverted order index according to an embodiment of the invention.The binary string of 64 " 1011011010001111 ... 0101011110011100 " be divided into eight sections " 10110110 ", " 1000111 " ..., " 10110111 ", " 10011100 ".Then, adjust above-mentioned 64 scale-of-two, as first 8, always have 8 kinds of combinations using any one, generate 8 parts of mappings.Then, utilize the mode of exact matching to search first 8.Like this, add in Sample Storehouse and have 2 34the Hash fingerprint of (similar 1,000,000,000), the signature set that each section is corresponding (that is, each table) returns to 2 (34-16)=262144 candidate result, have greatly reduced assessing the cost of hamming distance.
Inverted order index stores module 309, for signature segmentation is stored, particularly, is carried out the storage of hash value and inverted order index stores.
For example, Fig. 4 B is the schematic diagram of the example of inverted order index according to an embodiment of the invention.In the situation that the signature of 16 and hamming distance is less than or equal to 3(≤3) as the standard that judges whether approximate repetition, system 300 grabs microblogging content " Jingdone district two 11; I represent for myself; businessman's interest concessions 300,000; present top quality food; minimum price; give the sales promotion of power most ", then the model training module 303 by system 300 and simhash Hash module 305 be by this microblogging participle and carry out simhash processing, thereby obtain the signature f1:1010111101010011 of 16.Then, set up inverted order index storage, obtain structure as shown in Figure 4 B.When system 300 grabs microblogging content " Jingdone district is exactly fast, and the order computer in afternoon that hand over the morning has just been delivered to ", carry out as mentioned above participle and simhash and process, obtain the f2:1101011111001001 that signs.For signature f2, first obtain first segmentation 1101, retrieve first set of above-mentioned inverted order index stores structure, obtain the f1 that signs.Then, calculate the hamming distance of f1 and f2, and when this hamming distance is greater than 3, second set according to the above-mentioned inverted order index stores structure of second segmentation, 0111 retrieval, obtains another f1, then calculates its hamming distance, the like.If judge and repeat, directly return results; Otherwise this signature is also set up to inverted order index storage according to signature f1.
Fig. 5 is the process flow diagram of the simhash microblogging duplicate removal method 500 based on inverted order index according to an embodiment of the invention.As described in Figure 5, the method 500 starts at step S501, and in step S501, when system grabs arrives microblogging data, model training module 303 is carried out participle according to dictionary data to text.Then, in step S503, simhash module 305 is carried out word frequency statistics according to the result after participle to text, and is converted into N dimensional vector.Then, in step S505, simhash module 305 is carried out simhash computing, and in step S507, obtains the binary signature of a f position.Then, in step S509, duplicate removal computing module 307 according to setup parameter by the binary signature segmentation of this f position, wherein, the number of described segmentation is greater than the value of the parameter of described setting, and according to segmentation result, set up inverted order index, each section that " key word (key) " is signature, " value (value) " is this signature.In step S511, duplicate removal computing module 307 carrys out the signature set under passage retrieval segmentation and calculates corresponding hamming distance according to set up inverted order index, until travel through all index set equating with it.In step S513, determine that the hamming distance calculate is whether within the scope of setup parameter.If the hamming calculating distance is in the parameter area of described setting, the operation of thinking text to repeat and not needing to store, the method is returned to step S511 to carry out the signature set under the next segmentation of passage retrieval according to the inverted order index of being set up and to calculate corresponding hamming distance.If the hamming calculating distance not in the parameter area of described setting, thinks that described text does not repeat in step S515 by described fragmented storage in inverted order index stores module.
For example, after a large amount of Chinese text hamming distance operations, when simhash value hash result set is 64 binary codes, hamming distance preferably, within the scope of 0-7, can think that text repeats.
In the application's technical scheme, according to the inverted order index stores of setting up, carrying out simhash duplicate removal is key point of the present invention.Traditional duplicate removal method is broken through to the ageing and promptness of enterprise when processing micro-blog information; The more important thing is in judgement duplicate removal and set by directly having influence on the accuracy of duplicate removal, by the processing directly having influence on great information for Chinese hamming distance parameter.The promptness of these information monitorings of Dou Dui enterprise on microblogging has played key effect.
Above-described embodiment is only the preferred embodiments of the present invention, is not limited to the present invention.It will be apparent for a person skilled in the art that without departing from the spirit and scope of the present invention, can carry out various modifications and change to embodiments of the invention.Therefore, the invention is intended to contain all modifications or the modification falling within the scope of the present invention limiting as claim.

Claims (10)

1. a method for the microblogging duplicate removal based on inverted order index, described method comprises:
By model training module, according to dictionary data, text is carried out to participle;
By simhash module, according to the result after participle, text is carried out to word frequency statistics to be converted into N dimensional vector, and described N dimensional vector is carried out to simhash computing to obtain the binary signature of f position;
By duplicate removal computing module, carry out following operation:
According to setup parameter, by the binary signature segmentation of described f position, and set up inverted order index according to segmentation result;
According to set up inverted order index, carry out the signature set under passage retrieval the first segmentation, and calculate the corresponding hamming distance in described the first segmentation; And
Determine that the hamming calculating in described the first segmentation distance is whether within the scope of described setup parameter.
2. method according to claim 1, further comprises:
If the hamming calculating distance not in the parameter area of described setting, thinks that described text does not repeat by described fragmented storage in inverted order index stores module.
3. method according to claim 1 and 2, further comprises:
If in the parameter area of the distance of the hamming calculating in described the first segmentation in described setting, carry out the signature set under passage retrieval the second segmentation and calculate the corresponding hamming distance in described the second segmentation according to set up inverted order index; And
Determine that the hamming calculating in described the second segmentation distance is whether within the scope of described setup parameter.
4. method according to claim 1, wherein, the number of described segmentation is greater than the value of the parameter of described setting.
5. method according to claim 1, wherein, the parameter area of described setting is 0-7.
6. a system for the microblogging duplicate removal based on inverted order index, described system comprises:
Model training module, described model training module is configured to, according to dictionary data, text is carried out to participle;
Simhash module, described simhash module is configured to, according to the result after participle, text is carried out to word frequency statistics to be converted into N dimensional vector, and described N dimensional vector is carried out to simhash computing to obtain the binary signature of f position;
Duplicate removal computing module, described duplicate removal computing module is configured to carry out following operation:
According to setup parameter, by the binary signature segmentation of described f position, and set up inverted order index according to segmentation result;
According to set up inverted order index, carry out the signature set under passage retrieval the first segmentation, and the calculating hamming distance corresponding with the signature set of described the first segmentation; And
Determine that the hamming calculating in described the first segmentation distance is whether within the scope of described setup parameter.
7. system according to claim 6, wherein said duplicate removal computing module is further configured to:
If the hamming calculating distance not in the parameter area of described setting, thinks that described text does not repeat by described fragmented storage in inverted order index stores module.
8. system according to claim 6, described duplicate removal computing module is further configured to:
If in the parameter area of the distance of the hamming calculating in described the first segmentation in described setting, according to set up inverted order index, carry out the signature set under passage retrieval the second segmentation and calculate the hamming distance corresponding with the signature set of described the second segmentation; And
Determine that the hamming calculating in described the second segmentation distance is whether within the scope of described setup parameter.
9. system according to claim 6, wherein, the number of described segmentation is greater than the value of the parameter of described setting.
10. system according to claim 6, wherein, the parameter area of described setting is 0-7.
CN201310681714.9A 2013-12-12 2013-12-12 Microblog duplication-eliminating method and system based on reverse-order index Pending CN103646080A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310681714.9A CN103646080A (en) 2013-12-12 2013-12-12 Microblog duplication-eliminating method and system based on reverse-order index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310681714.9A CN103646080A (en) 2013-12-12 2013-12-12 Microblog duplication-eliminating method and system based on reverse-order index

Publications (1)

Publication Number Publication Date
CN103646080A true CN103646080A (en) 2014-03-19

Family

ID=50251294

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310681714.9A Pending CN103646080A (en) 2013-12-12 2013-12-12 Microblog duplication-eliminating method and system based on reverse-order index

Country Status (1)

Country Link
CN (1) CN103646080A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104407982A (en) * 2014-11-19 2015-03-11 湖南国科微电子有限公司 SSD (solid state drive) disk garbage recycling method
CN104615714A (en) * 2015-02-05 2015-05-13 北京中搜网络技术股份有限公司 Blog duplicate removal method based on text similarities and microblog channel features
CN105335422A (en) * 2014-08-06 2016-02-17 阿里巴巴集团控股有限公司 Public opinion information warning method and apparatus
CN105681046A (en) * 2016-02-29 2016-06-15 郑州悉知信息科技股份有限公司 UGC fingerprint signature determination method and device as well as UGC deduplication method and device
CN106126235A (en) * 2016-06-24 2016-11-16 中国科学院信息工程研究所 A kind of multiplexing code library construction method, the quick source tracing method of multiplexing code and system
WO2016180268A1 (en) * 2015-05-13 2016-11-17 阿里巴巴集团控股有限公司 Text aggregate method and device
CN106372105A (en) * 2016-08-19 2017-02-01 中国科学院信息工程研究所 Spark platform-based microblog data preprocessing method
CN106469097A (en) * 2016-09-02 2017-03-01 北京百度网讯科技有限公司 A kind of method and apparatus recalling error correction candidate based on artificial intelligence
CN106649273A (en) * 2016-12-26 2017-05-10 东软集团股份有限公司 Text processing method and text processing device
CN107122370A (en) * 2016-02-25 2017-09-01 阿里巴巴集团控股有限公司 A kind of distributed search method and device
CN107229694A (en) * 2017-05-22 2017-10-03 北京红马传媒文化发展有限公司 A kind of data message consistency processing method, system and device based on big data
CN107894979A (en) * 2017-11-21 2018-04-10 北京百度网讯科技有限公司 The compound process method, apparatus and its equipment excavated for semanteme
CN108319648A (en) * 2017-12-27 2018-07-24 深圳市三宝创新智能有限公司 A kind of question and answer Data clean system and method based on improvement simhash algorithms
CN109670153A (en) * 2018-12-21 2019-04-23 北京城市网邻信息技术有限公司 A kind of determination method, apparatus, storage medium and the terminal of similar model
CN110134803A (en) * 2019-05-17 2019-08-16 哈尔滨工程大学 Image data method for quickly retrieving based on Hash study
CN110737748A (en) * 2019-09-27 2020-01-31 成都数联铭品科技有限公司 text duplicate removal method and system
CN111859063A (en) * 2019-04-30 2020-10-30 北京智慧星光信息技术有限公司 Control method and device for monitoring transfer of seal information in Internet
CN113129056A (en) * 2021-04-15 2021-07-16 微梦创科网络科技(中国)有限公司 Method and system for controlling advertisement putting frequency
CN113434710A (en) * 2021-06-29 2021-09-24 平安普惠企业管理有限公司 Document retrieval method, document retrieval device, server and storage medium
CN113821599A (en) * 2021-09-15 2021-12-21 北京沃东天骏信息技术有限公司 Semantic fingerprint query method, device, equipment and storage medium
CN114281989A (en) * 2021-12-06 2022-04-05 重庆邮电大学 Data deduplication method and device based on text similarity, storage medium and server
CN114943021A (en) * 2022-07-20 2022-08-26 之江实验室 TB-level incremental data screening method and device

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020196976A1 (en) * 2001-04-24 2002-12-26 Mihcak M. Kivanc Robust recognizer of perceptually similar content
CN101887457A (en) * 2010-07-02 2010-11-17 杭州电子科技大学 Content-based copy image detection method
CN103324650A (en) * 2012-10-23 2013-09-25 深圳市宜搜科技发展有限公司 Image retrieval method and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020196976A1 (en) * 2001-04-24 2002-12-26 Mihcak M. Kivanc Robust recognizer of perceptually similar content
CN101887457A (en) * 2010-07-02 2010-11-17 杭州电子科技大学 Content-based copy image detection method
CN103324650A (en) * 2012-10-23 2013-09-25 深圳市宜搜科技发展有限公司 Image retrieval method and system

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335422A (en) * 2014-08-06 2016-02-17 阿里巴巴集团控股有限公司 Public opinion information warning method and apparatus
CN105335422B (en) * 2014-08-06 2019-02-22 阿里巴巴集团控股有限公司 The alarm method and device of public feelings information
CN104407982B (en) * 2014-11-19 2018-09-21 湖南国科微电子股份有限公司 A kind of SSD discs rubbish recovering method
CN104407982A (en) * 2014-11-19 2015-03-11 湖南国科微电子有限公司 SSD (solid state drive) disk garbage recycling method
CN104615714A (en) * 2015-02-05 2015-05-13 北京中搜网络技术股份有限公司 Blog duplicate removal method based on text similarities and microblog channel features
WO2016180268A1 (en) * 2015-05-13 2016-11-17 阿里巴巴集团控股有限公司 Text aggregate method and device
CN106294350A (en) * 2015-05-13 2017-01-04 阿里巴巴集团控股有限公司 A kind of text polymerization and device
CN106294350B (en) * 2015-05-13 2019-10-11 阿里巴巴集团控股有限公司 A kind of text polymerization and device
CN107122370A (en) * 2016-02-25 2017-09-01 阿里巴巴集团控股有限公司 A kind of distributed search method and device
CN105681046A (en) * 2016-02-29 2016-06-15 郑州悉知信息科技股份有限公司 UGC fingerprint signature determination method and device as well as UGC deduplication method and device
CN106126235A (en) * 2016-06-24 2016-11-16 中国科学院信息工程研究所 A kind of multiplexing code library construction method, the quick source tracing method of multiplexing code and system
CN106126235B (en) * 2016-06-24 2019-05-07 中国科学院信息工程研究所 A kind of multiplexing code base construction method, the quick source tracing method of multiplexing code and system
CN106372105A (en) * 2016-08-19 2017-02-01 中国科学院信息工程研究所 Spark platform-based microblog data preprocessing method
CN106469097A (en) * 2016-09-02 2017-03-01 北京百度网讯科技有限公司 A kind of method and apparatus recalling error correction candidate based on artificial intelligence
CN106469097B (en) * 2016-09-02 2019-08-27 北京百度网讯科技有限公司 A kind of method and apparatus for recalling error correction candidate based on artificial intelligence
CN106649273A (en) * 2016-12-26 2017-05-10 东软集团股份有限公司 Text processing method and text processing device
CN106649273B (en) * 2016-12-26 2020-03-17 东软集团股份有限公司 Text processing method and device
CN107229694A (en) * 2017-05-22 2017-10-03 北京红马传媒文化发展有限公司 A kind of data message consistency processing method, system and device based on big data
CN107894979B (en) * 2017-11-21 2021-09-17 北京百度网讯科技有限公司 Compound word processing method, device and equipment for semantic mining
CN107894979A (en) * 2017-11-21 2018-04-10 北京百度网讯科技有限公司 The compound process method, apparatus and its equipment excavated for semanteme
CN108319648A (en) * 2017-12-27 2018-07-24 深圳市三宝创新智能有限公司 A kind of question and answer Data clean system and method based on improvement simhash algorithms
CN109670153A (en) * 2018-12-21 2019-04-23 北京城市网邻信息技术有限公司 A kind of determination method, apparatus, storage medium and the terminal of similar model
CN109670153B (en) * 2018-12-21 2023-11-17 北京城市网邻信息技术有限公司 Method and device for determining similar posts, storage medium and terminal
CN111859063A (en) * 2019-04-30 2020-10-30 北京智慧星光信息技术有限公司 Control method and device for monitoring transfer of seal information in Internet
CN111859063B (en) * 2019-04-30 2023-11-03 北京智慧星光信息技术有限公司 Control method and device for monitoring transfer seal information in Internet
CN110134803B (en) * 2019-05-17 2020-12-11 哈尔滨工程大学 Image data quick retrieval method based on Hash learning
CN110134803A (en) * 2019-05-17 2019-08-16 哈尔滨工程大学 Image data method for quickly retrieving based on Hash study
CN110737748B (en) * 2019-09-27 2023-08-08 成都数联铭品科技有限公司 Text deduplication method and system
CN110737748A (en) * 2019-09-27 2020-01-31 成都数联铭品科技有限公司 text duplicate removal method and system
CN113129056A (en) * 2021-04-15 2021-07-16 微梦创科网络科技(中国)有限公司 Method and system for controlling advertisement putting frequency
CN113434710A (en) * 2021-06-29 2021-09-24 平安普惠企业管理有限公司 Document retrieval method, document retrieval device, server and storage medium
CN113821599A (en) * 2021-09-15 2021-12-21 北京沃东天骏信息技术有限公司 Semantic fingerprint query method, device, equipment and storage medium
CN114281989A (en) * 2021-12-06 2022-04-05 重庆邮电大学 Data deduplication method and device based on text similarity, storage medium and server
CN114943021A (en) * 2022-07-20 2022-08-26 之江实验室 TB-level incremental data screening method and device
US11789639B1 (en) 2022-07-20 2023-10-17 Zhejiang Lab Method and apparatus for screening TB-scale incremental data

Similar Documents

Publication Publication Date Title
CN103646080A (en) Microblog duplication-eliminating method and system based on reverse-order index
CN109885692B (en) Knowledge data storage method, apparatus, computer device and storage medium
CN108829658B (en) Method and device for discovering new words
CN104699763B (en) The text similarity gauging system of multiple features fusion
CN104199965B (en) Semantic information retrieval method
CN111428054A (en) Construction and storage method of knowledge graph in network space security field
Do et al. Multiview deep learning for predicting twitter users' location
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN103646029B (en) A kind of similarity calculating method for blog article
CN107229668A (en) A kind of text extracting method based on Keywords matching
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN104239553A (en) Entity recognition method based on Map-Reduce framework
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN110633365A (en) Word vector-based hierarchical multi-label text classification method and system
CN106599037A (en) Recommendation method based on label semantic normalization
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN112182248A (en) Statistical method for key policy of electricity price
CN112925907A (en) Microblog comment viewpoint object classification method based on event graph convolutional neural network
CN116610818A (en) Construction method and system of power transmission and transformation project knowledge base
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN105354264A (en) Locality-sensitive-hashing-based subject label fast endowing method
Suhasini et al. A Hybrid TF-IDF and N-Grams Based Feature Extraction Approach for Accurate Detection of Fake News on Twitter Data
CN117172235A (en) Class case discrimination method and system based on similarity measurement
CN103699568A (en) Method for extracting hyponymy relation of field terms from wikipedia
CN117010373A (en) Recommendation method for category and group to which asset management data of power equipment belong

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140319