CN102831198A

CN102831198A - Similar document identifying device and similar document identifying method based on document signature technology

Info

Publication number: CN102831198A
Application number: CN2012102784052A
Authority: CN
Inventors: 温赟; 杨青
Original assignee: PEOPLE SEARCH NETWORK AG
Current assignee: PEOPLE SEARCH NETWORK AG
Priority date: 2012-08-07
Filing date: 2012-08-07
Publication date: 2012-12-19

Abstract

The invention discloses a similar document identifying device and a similar document identifying method based on a document signature technology. The similar document identifying device mainly comprises a content extracting module, a feature extracting module, a document signature computing module, a document signature indexing module and a similar document searching module. By the similar document identifying device and the similar document identifying method, problems that space complexity in an existing similar document identifying technology is high, the existing similar document identifying technology cannot meet application requirements on text streaming processing, a repeated text identifying technology with high space efficiency cannot identify similar texts, and the like are solved; and the similar document identifying method is a quick similarity identifying method for a large number of streaming documents.

Description

A kind of similar document recognition device and method based on the document signature technology

Technical field

The present invention relates to data mining and information retrieval technique, relate in particular to a kind of similar document recognition device and method based on the document signature technology.

Background technology

The alleged document of the present invention not only refers to traditional structured text document, also comprises multi-medium datas such as semi-structured HTML(Hypertext Markup Language) webpage, picture, video.In view of text class document range of application is wider, this instructions will be that example describes with text class document.

Similar document identification has significance for many applications.With the information retrieval field is example, has relevant statistics to point out, has repetition or similar web page more than 40% in the internet.In vertical search products such as news, video, picture, owing to operations such as reprinting, share has also produced a large amount of similar contents.Identify these similar web pages and not only help improving data-handling efficiency, more help to reduce the Search Results repetition rate to improve user experience.In addition, similar document is identified in fields such as plagiarizing detection, mechanical translation also has important application.

Traditional repeated text recognition technology scheme adopts cryptographic hash technology such as calculating document MD5 value, can only solve the identical repetitive file identification problem of content.Yet similar document a little change in the reprinting process possibly make and have some differences on the content, cause the inefficacy of cryptographic hash technology.

The main method that adopts based on vector space model (Vector Space Model) of text similarity identification at present; Like publication number is CN 102314418; Name is called the invention application (to call document 1 in the following text) of " a kind of based on context-sensitive Chinese similarity comparative approach "; It is abstracted into a vector in the text vector space with destination document; The keyword that occurs in the document is as a dimension of this vector, uses number of times that this keyword occurs in the document value as corresponding dimension usually.Can calculate the similarity measure of the cosine similarity of two vectors as two documents.The identification that has solved similar content to a certain extent based on the method for vector space model, but its space consuming is huge, needs the content-data of each document of storage, or still be proportional to the text vector information of document content length after the compression.Publication number is CN101576904; The name be called " a kind of based on have weight graph calculate the content of text similarity System and method for " invention application (to call document 2 in the following text); It adopts and from collection of document, has constructed weight graph; And based on the similarity between any two nodes in the weight graph calculating chart is arranged, and then obtain the method for the similarity of document.But this method can only be handled static collection of document, is not suitable for the application scenarios of streaming processing such as information retrieval.

Summary of the invention

In view of this; Fundamental purpose of the present invention is to provide a kind of similar document recognition device and method based on the document signature technology; With solve in the existing similar text identification technology space complexity high, can't tackle the application demand that the text streaming is handled, and the high repeated text recognition technology of space efficiency can't be discerned the problem of similar text etc. again; Also being provides the recognition methods of a kind of similarity fast for the magnanimity document of streaming.

For achieving the above object, technical scheme of the present invention is achieved in that

A kind of similar document recognition device based on the document signature technology mainly comprises content extraction module, the feature extraction module, and the document signature computing module, document signature index module and similar document are searched module; Wherein:

Content extraction module is used for the Document Title of extracting objects document, the word content of text, obtains body matter;

The feature extraction module is used for said body matter is converted into the character representation form of corresponding < token, weight>doublet set, and passes to said document signature computing module;

The document signature computing module is used for original token is converted into corresponding cryptographic hash, and combines the corresponding weight weight of current token to upgrade the document signature value, obtains the document signature value of final regular length;

The document signature index module is used for above-mentioned document signature will be stored in the document signature index module, or directly stores the set of whole signature storehouse; And

Similar document is searched module, in existing document signature index, searches and the document signature of its distance less than certain threshold value d, and will return the final ID of the corresponding document signature of similar document as destination document.

Wherein, said distance is binary-coded hamming distance, and said threshold value d is 3.

A kind of similar document recognition methods based on the document signature technology comprises:

The Document Title of A, extracting objects document, the word content of text obtain the step of body matter;

B, convert said body matter the character representation form of corresponding < token, weight>doublet set into, and pass to the step of said document signature computing module;

C, original token is converted into corresponding cryptographic hash, and combines the corresponding weight weight of current token to upgrade the document signature value, obtain the step of the document signature value of final regular length;

D, above-mentioned document signature is stored in the document signature index module or directly stores the step of whole signature storehouse set;

E, in existing document signature index, search and its distance less than the document signature of certain threshold value d, and will return the step of the corresponding document signature of similar document as the final ID of destination document.

Wherein, said steps A is specially:

A1, analyzing web page HTML html source code are found out the text block that comprises title, body matter information, in this process, remove irrelevant information;

A2, after removing irrelevant information in the text block that after steps A 1 is handled, obtains and handling, in the text chunk that obtains, use template matching method to remove noise information.

Said step B is specially:

B1, at first document is carried out word segmentation processing, obtain the term sequence of text word segmentation result;

B2, for k continuous in a term sequence term, form a characteristic token, parameter k is 2;

B3, for each token that constructs among the step B2, calculate corresponding weight weight, get number of times tf that token occurs as weight index in document content.

The process of document signature calculation is among the said step C:

For < the token that obtains behind the completing steps B; Weight>set; The character representation that is used as source document is passed to the document signature computing module, and this module is converted into corresponding cryptographic hash with original token successively, and combines the corresponding weight weight of current token to upgrade the document signature value; In accomplishing character representation, after the processing of all token, obtain the document signature value of final regular length.

Wherein, adopt the bits string representation document signature of 64bit, total can represent 2 ⁶⁴The state of kind.

Similar document recognition device and method based on the document signature technology provided by the present invention have the following advantages:

The present invention adopts the document signature technology that document is expressed as the document signature value of regular length, document similarity computational problem is converted into the computational problem of signature value distance, has solved the problem that the conventional cryptography salted hash Salted can't be discerned similar document.Compared to the similar document recognition methods based on vector space model, the document signature of regular length has greatly reduced storage space, more helps to handle efficiently mass data.The present invention also adopts the document signature of the existing collection of document of increment type index technology storage, and compares based on the signature of this index to destination document, thereby is applicable to the application scenarios of the streaming excavation of dynamic text stream.

Description of drawings

Fig. 1 is the similar document recognition device synoptic diagram that the present invention is based on the document signature technology;

Fig. 2 is the algorithm flow chart of document signature process in the step 3 of the present invention.

Embodiment

Below in conjunction with accompanying drawing and embodiments of the invention device and method of the present invention is done further detailed explanation.

It is high but can't discern similar content of text to the present invention is directed to existing repeated text recognition methods space efficiency; And based on the problems such as similar text recognition method space complexity height of vector space model; Proposed a kind of similar document recognition methods based on document signature technology, purpose is that the magnanimity document for streaming provides a kind of similar fast recognition methods.

Fig. 1 is the similar document recognition device synoptic diagram that the present invention is based on the document signature technology; As shown in Figure 1; For an embodiment of this similar document recognition device (similar news web page goes heavy service system) comprises five main functional modules: content extraction module; The feature extraction module, the document signature computing module, document signature index module and similar document are searched module; Said five functional modules are respectively applied for carries out five corresponding treatment steps.

Step 1: if for documents such as target news web pages, content extraction module will extract the word content of news (document) title, text.Particularly, be divided into two sub-steps again:

Step 11, analyzing web page html source code are found out the text block that comprises headline, body matter information, in this process, remove irrelevant informations such as advertisement link, navigation bar, help to improve the accuracy rate of similar identification.

Step 12, in the text block that after above-mentioned steps 11 is handled, obtains, remove irrelevant information such as html tag; And the method for in the text chunk that obtains, using template matches removes the common copyright statement text of each flash-news website, it " shared " noise information such as linked contents, further improves the precision that meaningful body matter extracts.

The contents extraction process is at first extracted wherein significant " text " content part C for destination document D, gets rid of insignificant noise information in the source document, thereby plays the purpose that improves the similar document recognition accuracy.

Step 2:, be converted into the character representation form of corresponding < token, weight>doublet set through the feature extraction module for accomplishing news web page (document) the body matter C that obtains after the above-mentioned steps 1.It will therefrom extract keyword token through the feature extraction module, and provide the weight weight of this keyword token, and corresponding < token, weight>doublet set will be as the character representation of source document.

Particularly, be divided into three sub-steps again:

Step 21, at first carry out word segmentation processing, obtain the term sequence of text word segmentation result for newsletter archive (document).

Step 22, for k continuous in a term sequence term; Form a characteristic token; In this embodiment, parameter k value is 2, and 1 parameter k is set to 1 than document; Added consideration among the present invention, can avoid the different mistakes identification of the identical but appearance of term content order better the term positional information.

Step 23, for each token that constructs in the step 22; Calculate corresponding weight weight; The number of times tf that employing token occurs in the document text content in embodiment of the present invention is as weight index; Directly adopt the simple strategy of unit weights in the document 2, embodiment of the present invention helps avoid the mistake relevant with word frequency and discerns.

Step 3 the: for < token that obtains behind the completing steps 2; Weight>set; The character representation that will be used as source news web page (document) passes to the document signature computing module; This module can be converted into corresponding cryptographic hash with original token successively, and combines the corresponding weight weight of current token to upgrade the document signature value.In accomplishing character representation, after the processing of all token, just can obtain the document signature value of final regular length.Its algorithm flow is shown in accompanying drawing 2.

In embodiment of the present invention, adopt the bits string representation document signature of 64bit (position), total can represent 2 ⁶⁴Plant different conditions.If necessary, can adjust signature length to adapt to the different application scenes demand.

Step 4: for historical news web page, the document signature of calculating through step 3 will all be stored in the document signature index module, and the simplest strategy is exactly directly storage whole signature storehouse set.

Through above-mentioned processing; Next step is that similar document signature in the step 5 is searched being linear complexity, in embodiment of the present invention, adopts the scheme of burst index; 64 bit signature bit string is divided into 4 16 bit bit strings; And be that key is stored in the corresponding index structure with 16 bit bit strings separately, in other words, whole index is made up of the subindex of 4 structural similarities jointly; Each subindex, is tabulated as value with all document signature of sharing this 16 bit substring as key by corresponding 16 bit.This index scheme is used certain storage redundancy, and similar document signature search procedure in the accelerating step 5 greatly is because the scope that key searches linear search is restricted to 1/2 of original scope ¹⁶

In addition, with the document signature index module it is processed into certain inside indexed format and stores, one can use compress technique to reduce the storage space expense, and two help to accelerate follow-up similar document searching speed

Compared to the technical scheme based on VSM, embodiment of the present invention has greatly compressed the storage space complexity based on the document signature of regular length.

Step 5: for the target news web page document signature S that calculates through step 3; In existing document signature index, search and the document signature of its distance less than certain threshold value d; If exist; Return the final ID of the corresponding document signature of similar document, otherwise return text signature value that step 3 calculates as document id as destination document.

In embodiment of the present invention; Adopt binary-coded hamming distance (Hamming Distance) as distance metric; Minimum similarity distance threshold parameter d elects 3 as; Mean that just the bit figure place that there are differences between two 64 bit bit strings is less than or equal at 3 o'clock, two corresponding news web pages will be considered to similar document.

On the news web page test data set, embodiment of the present invention has been obtained 95% accuracy rate, far above providing the accuracy rate index of technical scheme separately in document 1, the document 2.

The above is merely preferred embodiment of the present invention, is not to be used to limit protection scope of the present invention.

Claims

1. the similar document recognition device based on the document signature technology is characterized in that, mainly comprises content extraction module, the feature extraction module, and the document signature computing module, document signature index module and similar document are searched module; Wherein:

2. the similar document recognition device based on the document signature technology according to claim 1 is characterized in that said distance is binary-coded hamming distance, and said threshold value d is 3.

3. the similar document recognition methods based on the document signature technology is characterized in that, comprising:

4. the similar document recognition methods based on the document signature technology according to claim 3 is characterized in that said steps A is specially:

5. the similar document recognition methods based on the document signature technology according to claim 3 is characterized in that said step B is specially:

6. the similar document recognition methods based on the document signature technology according to claim 3 is characterized in that the process of document signature calculation is among the said step C:

7. the similar document recognition methods based on the document signature technology according to claim 6 is characterized in that, adopts the bits string representation document signature of 64bit, and total can represent 2 ⁶⁴The state of kind.