CN103646080A

CN103646080A - Microblog duplication-eliminating method and system based on reverse-order index

Info

Publication number: CN103646080A
Application number: CN201310681714.9A
Authority: CN
Inventors: 王鑫文
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2013-12-12
Filing date: 2013-12-12
Publication date: 2014-03-19

Abstract

The invention relates to a microblog duplication-eliminating method and system based on reverse-order index. The method comprises the steps as follows: a text is subjected to word segmentation by a model training module according to lexicon data; the text is subjected to word frequency statistics by a simhash module according to a result after the word segmentation and is converted into an N-dimensional vector, and simhash calculation is performed on the N-dimensional vector, so that an f-bit binary signature is obtained; a duplication-eliminating calculation module executes the following operation: the f-bit binary signature is segmented according to set parameters, and the reverse-order index is established according to a segmentation result; signature collection of first segmentation is searched segmentally according to the established reverse-order index, and a corresponding hamming distance in the first segmentation is calculated; and whether the calculated hamming distance in the first segmentation is in the set parameter range is determined.

Description

Microblogging duplicate removal method and system based on inverted order index

Technical field

The present invention relates to the information analysis field based on microblogging, and relate to particularly the microblogging duplicate removal method and system based on inverted order index.

Background technology

Along with the development of internet, microblogging is becoming the main channel of Information Communication, ordinary consumer feedback problem and complaint.For enterprise, timely active process microblogging institute's reflection problem and to stop a large amount of diffusions of negative information be the main task of customer service department of enterprise and the department of public relations, and will directly affect brand image and the commercial value of enterprise.Ageing and the validity of a large amount of microbloggings that information analysis system grabs at microblogging will directly affect treatment effeciency and the promptness of customer service department and the department of public relations.

For fear of the appearance of duplicate contents, need to repeat judgement (being called for short " sentencing heavily "), to reduce storage, strengthen counting yield and to improve user's experience.To sentencing heavily of microblogging content of text, existing technical scheme is mainly taked the methods such as character string comparison editing distance, the calculating of cosine law similarity, simhash duplicate removal.

About character string comparison edit distance approach, the method is based on convert another required minimum editing operation number of times to by one between two character strings.The editing operation of license comprises a character replacement is become to another character, inserts a character, deletes a character.For example character string A content is x ₁x ₂x ₃x ₄x ₅, character string B content is y ₁y ₂y ₃y ₄y ₅if, B is become to A and need to edit number M, similarity is 1-M/N, and wherein N is string length, and similarity more approaches 1, illustrates more similar.

About cosine law similarity calculating method, model dictionary, the word recording according to dictionary carries out participle to microblogging data, after participle, participle statistics word occurrence number is added up, for example text Z _1c1, Z _1c2, Z _1c3, Z _1c4z _1cn; Their numbers in text are: Z _1n1, Z _1n2, Z _1n3z _1nm, another text Z _2c1, Z _2c2, Z _2c3, Z _2c4z _2cn; Their numbers in chapters and sections are: Z _2n1, Z _2n2, Z _2n3z _2nm, like this two text-converted are become to two vectors, between two vectors, can calculate its similarity by the cosine law, computing formula is as follows:

Result of calculation more approaches 1 and shows that similarity is higher.

About simhash duplicate removal method, by after microblogging Chinese word segmentation, being converted to vector value is the N dimensional vector of word frequency, and the input of simhash computing is this vector, and output is the signature value of a f position, by calculating the hamming distance of two signature values, by judging hamming distance, whether within setup parameter scope, if within setting range, judge that these two texts are similar, feature weight is word frequency, then this vector is converted to a signature value.As shown in Figure 1, as shown in Figure 2, wherein simhash process is whole duplicate removal process flow diagram Simhash process:

1. the vectorial V of a f dimension is initialized as to 0; The binary number S of f position is initialized as 0;

2. to each feature: with traditional hash algorithm, this feature is produced the signature b of a f position.I=1 is arrived to f:

If the i position of b is 1, i the element of V adds the weight of this feature;

Otherwise i the element of V deducts the weight of this feature.

3. if i the element of V is greater than 0, the i position of S is 1, otherwise is 0;

4. output S is as signature.

The shortcoming of prior art scheme

The in the situation that of a large amount of microblogging data, any duplicate removal method efficiency is all lower, especially when microblogging captures into information analysis system, also need judgement in system, whether to have the microblogging similar to this microblogging (forwarding microblogging), at this moment operand is excessive, can directly affect the ageing of microblogging.

For above-mentioned duplicate removal method, all relatively to determine whether between two repetition based on text, for present internet information, in the very large situation of microblogging data of every day, at information analysis system, grab after microblogging, the computing that determines whether repetition is just very huge, take that first to have microblogging data N bar be example, system grabs, after microblogging, determines whether repetition, the worst need to comparison N time, just can judge whether repetition.Such operation efficiency is too low.

The shortcoming existing based on prior art, we have proposed a kind of simhash duplicate removal method based on inverted order index, and the method is a kind of improvement algorithm based on simhash, can meet the assurance of operation efficiency under large data operation.The present invention has made up the inefficiency of duplicate removal method for large data operation, has catered to the effective refinement for microblogging data, has improved enterprise in reply micro-blog information diffusion promptness.

Summary of the invention

According to one embodiment of present invention, provide a kind of method of the microblogging duplicate removal based on inverted order index, described method comprises: by model training module, according to dictionary data, text is carried out to participle; By simhash module, according to the result after participle, text is carried out to word frequency statistics to be converted into N dimensional vector, and described N dimensional vector is carried out to simhash computing to obtain the binary signature of f position; By duplicate removal computing module, carry out following operation: according to setup parameter, by the binary signature segmentation of described f position, and set up inverted order index according to segmentation result; According to set up inverted order index, carry out the signature set under passage retrieval the first segmentation, and the calculating hamming distance corresponding with the signature set of described the first segmentation; And determine that the hamming calculating in described the first segmentation distance is whether within the scope of described setup parameter.

Preferably, described method further comprises: if the hamming calculating distance not in the parameter area of described setting, thinks that described text does not repeat by described fragmented storage in inverted order index stores module.

Preferably, described method further comprises: if in the parameter area of the distance of the hamming calculating in described the first segmentation in described setting, according to set up inverted order index, carry out the signature set under passage retrieval the second segmentation and calculate the hamming distance corresponding with the signature set of described the second segmentation; And determine that the hamming calculating in described the second segmentation distance is whether within the scope of described setup parameter.

Preferably, the number of described segmentation is greater than the value of the parameter of described setting.

Preferably, the parameter area of described setting is 0-7.

According to another embodiment of the invention, provide a kind of system of the microblogging duplicate removal based on inverted order index, described system comprises: model training module, and described model training module is configured to, according to dictionary data, text is carried out to participle; Simhash module, described simhash module is configured to, according to the result after participle, text is carried out to word frequency statistics to be converted into N dimensional vector, and described N dimensional vector is carried out to simhash computing to obtain the binary signature of f position; Duplicate removal computing module, described duplicate removal computing module is configured to carry out following operation: according to setup parameter, by the binary signature segmentation of described f position, and set up inverted order index according to segmentation result; According to set up inverted order index, carry out the signature set under passage retrieval the first segmentation, and the calculating hamming distance corresponding with the signature set of described the first segmentation; And determine that the hamming calculating in described the first segmentation distance is whether within the scope of described setup parameter.

Preferably, described duplicate removal computing module is further configured to: if the hamming calculating distance not in the parameter area of described setting, thinks that described text does not repeat by described fragmented storage in inverted order index stores module.

Preferably, described duplicate removal computing module is further configured to: if in the parameter area of the distance of the hamming calculating in described the first segmentation in described setting, according to set up inverted order index, carry out the signature set under passage retrieval the second segmentation and calculate the hamming distance corresponding with the signature set of described the second segmentation; And determine that the hamming calculating in described the second segmentation distance is whether within the scope of described setup parameter

Preferably, the parameter area of described setting is 0-7.

According to duplicate removal technical scheme of the present invention, can, when reducing room and time complicacy, guarantee the degree of accuracy of calculating.According to the detailed description below of the disclosure and accompanying drawing, other object, feature and advantage will be apparent to those skilled in the art.

Accompanying drawing explanation

Accompanying drawing illustrates embodiments of the invention, and is used from and explains principle of the present invention with instructions one.In the accompanying drawings:

Fig. 1 is the schematic diagram of simhash Hash procedure.

Fig. 2 is the process flow diagram that simhash duplicate removal is processed.

Fig. 3 is the block diagram of the system of the simhash duplicate removal based on inverted order index according to an embodiment of the invention.

Fig. 4 A is the schematic diagram of inverted order index according to an embodiment of the invention.

Fig. 4 B is the schematic diagram of the example of inverted order index according to an embodiment of the invention.

Fig. 5 is the process flow diagram of the simhash microblogging duplicate removal method based on inverted order index according to an embodiment of the invention.

Embodiment

Explain in detail below with reference to accompanying drawings technical scheme according to an embodiment of the invention.

Term " micro-blog information supervisory system " refers to by integrating internet information acquisition technology and information intelligent treatment technology microblogging website is captured fast as used herein, by natural language processing technique, data are carried out the processing such as duplicate removal, rubbish filtering, cluster, form valuable data message, thereby grasp information branch of consumer groups for client comprehensively, make correct information guiding, analysis foundation is provided.

The technical scheme of the simhash microblogging duplicate removal method based on inverted order index disclosed by the invention is improved new technical scheme on the basis of original simhash duplicate removal method.

Term " participle " refers to continuous word sequence is reassembled into the process of word sequence according to certain standard as used herein.In order to carry out Chinese information filtration, first will carry out Chinese word segmentation to text pre-service, be expressed as calculating the model with reasoning.Chinese word segmentation is exactly that Chinese Chinese character sequence is divided into significant word.Participle is a part for Chinese information processing, and participle itself is not object, but the necessary stage of subsequent processes is the basic technology of Chinese information processing.Although the Chinese text of take is in the present invention illustrated as example, but those skilled in the art understand, described text is not limited only to Chinese text, and any text based on determining the language on word border all can be applied technical scheme of the present invention, such as Japanese text, Korean text etc.

Although there is various minutes word algorithms, for a ripe Words partition system, can not rely on separately some algorithms to realize, all need comprehensive different algorithm, in actual application, select according to specific circumstances different participle schemes.The accuracy of participle is related to the quality of result for retrieval.The key step that Chinese lexical analysis is taked is at present: first take the methods such as maximum coupling, shortest path, probability statistics or full cutting, obtain a relatively good rough segmentation result, then arrange discrimination, unregistered word identification, finally mark part of speech.In actual system, these three processes may mutually intersect, repeatedly merge, and also may not have obvious precedence.

Although participle accuracy is very important concerning duplicate removal, if participle speed is too slow, even if accuracy is high again, for information analysis system, be also disabled.Because information analysis system need to be processed hundreds of millions of webpages, if the overlong time that participle consumes can have a strong impact on the speed of information analysis system content update.So for information analysis system, the accuracy of participle and speed, the two all needs the requirement that reaches very high.

Term " word frequency " refers to the frequency that in a sentence or one piece of article, various words occur as used herein, and it is a basic fundamental of Chinese information processing, in a lot of fields, has important application.From in form, word is the combination of stable word, and therefore, in context, the number of times that adjacent word occurs is simultaneously more, just more likely forms a word.So word and the frequency of the adjacent co-occurrence of word or the confidence level that probability can be reacted into word preferably.Remove conventional especially word, the word that in one piece of article, the frequency of occurrences is higher can reflect the theme of this piece of article conventionally, therefore can to Chinese article, carry out text cluster by word frequency.

In addition,, in normal situation, closely similar web page contents can not provide fresh information maybe can only provide a small amount of fresh information to user to user, but can consume a large amount of server resources to the processing of pixel web page contents., should consider meanwhile, if certain webpage repeatability is very high, show that this content is more welcome, also indicate that this webpage is relatively important, should give higher weight.

Hamming distance refers to the different figure place of encoding on two corresponding positions of legitimate code in information coding as used herein.The different bit number of corresponding bit value of two code words is called the hamming distance of these two code words.An efficient coding is concentrated, and the minimum value of the hamming distance of any two code words is called the hamming distance of this coded set.The different number of the number of bits of two documents is more, and hamming distance is larger.Hamming distance is larger, illustrates that two document dissimilarities are larger, otherwise, less.Different systems may judge with different hamming distance values the whether approximate repetition of two web page contents.Conventionally, for the binary numeral of 64, hamming distance is less than or equal to 3(≤3) as judging whether the approximate standard repeating.For example: 10101 and 00110 has first, the 4th, the 5th difference successively since first, hamming distance is 3.If setting parameter is 3, can judge that these two sections of texts repeat.

Technical scheme according to the present invention is carried out segmentation by the signature obtaining from simhash computing and is set up inverted order index and with piecewise one by one, sentence the calculating of heavy and hamming distance.Principle of the present invention and principle of pigeon hole are similar, and in principle of pigeon hole, 5 pigeons are placed on 4 cages must a pigeon >2 in cage.If be 7 by setting parameter in the present invention, this is equivalent to 7 different binary digits to put into 8 segmentations, must have so a segmentation to equate.Duplicate removal based on inverted order index of the present invention is according to this principle.

Fig. 3 is the block diagram of the system 300 of the simhash duplicate removal based on inverted order index according to an embodiment of the invention.As shown in Figure 3, system 300 comprises data management module 301, model training module 303, simhash Hash module 305, duplicate removal computing module 307 and inverted order index stores module 309.

Data management module 301 carries out dictionary management for the message content to from meagre collection.Described message content comprises the information such as microblogging content, forwarding relation, bloger ID, issuing time.

Model training module 303 is for carrying out microblogging Chinese word segmentation to manage the content of submodule 302 from dictionary.Simhash Hash module 305 is for carrying out vector conversion and simhash computing.For example " the simhash microblogging duplicate removal method based on inverted order index ", word segmentation result be " based on, inverted order, index, simhash, microblogging, duplicate removal, method ", respective weights is respectively (1,1,1,1,1,1,1), this is 7 dimensional vectors.

Duplicate removal computing module 307 for carrying out, block by hash value, segment lookup and hamming be apart from calculating.Particularly, duplicate removal computing module 307 is according to the parameter of setting by the binary signature segmentation of this f position, and wherein, the number of described segmentation is greater than the value of the parameter of described setting, and according to segmentation result, set up inverted order index and carry out passage retrieval and hamming apart from calculating, to repeat judgement.That is,, if repeated, return to judged result and repeat; If do not repeated, the signature set based under the next segmentation of inverted order indexed search, a to the last segmentation by that analogy.

For example, duplicate removal computing module 307 will wait that sentencing heavy signature is divided into 8 segmentations, and the inverted order index that foundation is set up is according to the signature set under this segmentation of first passage retrieval, one by one calculate the hamming distance corresponding with signature set and repeat with judgement, until travel through all index set equating with it.Whether the hamming distance that then, duplicate removal computing module 307 judgement is calculated is within the scope of setup parameter.If the hamming calculating distance is within the scope of setup parameter, judgement repeats and returns to judged result and repeat; If the hamming calculating distance is not within the scope of setup parameter, judgement does not repeat, and according to the signature set under second segmentation of second passage retrieval, by that analogy until the 8th section.

More specifically, first, the f position signature segmentation by obtaining from simhash Hash module, for example, be divided into 8 sections, then each section is mapped to this signature, as shown in Figure 4 A.Referring to Fig. 4 A, it is the schematic diagram of inverted order index according to an embodiment of the invention.The binary string of 64 " 1011011010001111 ... 0101011110011100 " be divided into eight sections " 10110110 ", " 1000111 " ..., " 10110111 ", " 10011100 ".Then, adjust above-mentioned 64 scale-of-two, as first 8, always have 8 kinds of combinations using any one, generate 8 parts of mappings.Then, utilize the mode of exact matching to search first 8.Like this, add in Sample Storehouse and have 2 ³⁴the Hash fingerprint of (similar 1,000,000,000), the signature set that each section is corresponding (that is, each table) returns to 2 ^(34-16)=262144 candidate result, have greatly reduced assessing the cost of hamming distance.

Inverted order index stores module 309, for signature segmentation is stored, particularly, is carried out the storage of hash value and inverted order index stores.

For example, Fig. 4 B is the schematic diagram of the example of inverted order index according to an embodiment of the invention.In the situation that the signature of 16 and hamming distance is less than or equal to 3(≤3) as the standard that judges whether approximate repetition, system 300 grabs microblogging content " Jingdone district two 11; I represent for myself; businessman's interest concessions 300,000; present top quality food; minimum price; give the sales promotion of power most ", then the model training module 303 by system 300 and simhash Hash module 305 be by this microblogging participle and carry out simhash processing, thereby obtain the signature f1:1010111101010011 of 16.Then, set up inverted order index storage, obtain structure as shown in Figure 4 B.When system 300 grabs microblogging content " Jingdone district is exactly fast, and the order computer in afternoon that hand over the morning has just been delivered to ", carry out as mentioned above participle and simhash and process, obtain the f2:1101011111001001 that signs.For signature f2, first obtain first segmentation 1101, retrieve first set of above-mentioned inverted order index stores structure, obtain the f1 that signs.Then, calculate the hamming distance of f1 and f2, and when this hamming distance is greater than 3, second set according to the above-mentioned inverted order index stores structure of second segmentation, 0111 retrieval, obtains another f1, then calculates its hamming distance, the like.If judge and repeat, directly return results; Otherwise this signature is also set up to inverted order index storage according to signature f1.

Fig. 5 is the process flow diagram of the simhash microblogging duplicate removal method 500 based on inverted order index according to an embodiment of the invention.As described in Figure 5, the method 500 starts at step S501, and in step S501, when system grabs arrives microblogging data, model training module 303 is carried out participle according to dictionary data to text.Then, in step S503, simhash module 305 is carried out word frequency statistics according to the result after participle to text, and is converted into N dimensional vector.Then, in step S505, simhash module 305 is carried out simhash computing, and in step S507, obtains the binary signature of a f position.Then, in step S509, duplicate removal computing module 307 according to setup parameter by the binary signature segmentation of this f position, wherein, the number of described segmentation is greater than the value of the parameter of described setting, and according to segmentation result, set up inverted order index, each section that " key word (key) " is signature, " value (value) " is this signature.In step S511, duplicate removal computing module 307 carrys out the signature set under passage retrieval segmentation and calculates corresponding hamming distance according to set up inverted order index, until travel through all index set equating with it.In step S513, determine that the hamming distance calculate is whether within the scope of setup parameter.If the hamming calculating distance is in the parameter area of described setting, the operation of thinking text to repeat and not needing to store, the method is returned to step S511 to carry out the signature set under the next segmentation of passage retrieval according to the inverted order index of being set up and to calculate corresponding hamming distance.If the hamming calculating distance not in the parameter area of described setting, thinks that described text does not repeat in step S515 by described fragmented storage in inverted order index stores module.

For example, after a large amount of Chinese text hamming distance operations, when simhash value hash result set is 64 binary codes, hamming distance preferably, within the scope of 0-7, can think that text repeats.

In the application's technical scheme, according to the inverted order index stores of setting up, carrying out simhash duplicate removal is key point of the present invention.Traditional duplicate removal method is broken through to the ageing and promptness of enterprise when processing micro-blog information; The more important thing is in judgement duplicate removal and set by directly having influence on the accuracy of duplicate removal, by the processing directly having influence on great information for Chinese hamming distance parameter.The promptness of these information monitorings of Dou Dui enterprise on microblogging has played key effect.

Above-described embodiment is only the preferred embodiments of the present invention, is not limited to the present invention.It will be apparent for a person skilled in the art that without departing from the spirit and scope of the present invention, can carry out various modifications and change to embodiments of the invention.Therefore, the invention is intended to contain all modifications or the modification falling within the scope of the present invention limiting as claim.

Claims

1. a method for the microblogging duplicate removal based on inverted order index, described method comprises:

By model training module, according to dictionary data, text is carried out to participle;

By simhash module, according to the result after participle, text is carried out to word frequency statistics to be converted into N dimensional vector, and described N dimensional vector is carried out to simhash computing to obtain the binary signature of f position;

By duplicate removal computing module, carry out following operation:

According to setup parameter, by the binary signature segmentation of described f position, and set up inverted order index according to segmentation result;

According to set up inverted order index, carry out the signature set under passage retrieval the first segmentation, and calculate the corresponding hamming distance in described the first segmentation; And

Determine that the hamming calculating in described the first segmentation distance is whether within the scope of described setup parameter.

2. method according to claim 1, further comprises:

If the hamming calculating distance not in the parameter area of described setting, thinks that described text does not repeat by described fragmented storage in inverted order index stores module.

3. method according to claim 1 and 2, further comprises:

If in the parameter area of the distance of the hamming calculating in described the first segmentation in described setting, carry out the signature set under passage retrieval the second segmentation and calculate the corresponding hamming distance in described the second segmentation according to set up inverted order index; And

Determine that the hamming calculating in described the second segmentation distance is whether within the scope of described setup parameter.

4. method according to claim 1, wherein, the number of described segmentation is greater than the value of the parameter of described setting.

5. method according to claim 1, wherein, the parameter area of described setting is 0-7.

6. a system for the microblogging duplicate removal based on inverted order index, described system comprises:

Model training module, described model training module is configured to, according to dictionary data, text is carried out to participle;

Simhash module, described simhash module is configured to, according to the result after participle, text is carried out to word frequency statistics to be converted into N dimensional vector, and described N dimensional vector is carried out to simhash computing to obtain the binary signature of f position;

Duplicate removal computing module, described duplicate removal computing module is configured to carry out following operation:

According to set up inverted order index, carry out the signature set under passage retrieval the first segmentation, and the calculating hamming distance corresponding with the signature set of described the first segmentation; And

7. system according to claim 6, wherein said duplicate removal computing module is further configured to:

8. system according to claim 6, described duplicate removal computing module is further configured to:

If in the parameter area of the distance of the hamming calculating in described the first segmentation in described setting, according to set up inverted order index, carry out the signature set under passage retrieval the second segmentation and calculate the hamming distance corresponding with the signature set of described the second segmentation; And

9. system according to claim 6, wherein, the number of described segmentation is greater than the value of the parameter of described setting.

10. system according to claim 6, wherein, the parameter area of described setting is 0-7.