CN102831198A - Similar document identifying device and similar document identifying method based on document signature technology - Google Patents

Similar document identifying device and similar document identifying method based on document signature technology Download PDF

Info

Publication number
CN102831198A
CN102831198A CN2012102784052A CN201210278405A CN102831198A CN 102831198 A CN102831198 A CN 102831198A CN 2012102784052 A CN2012102784052 A CN 2012102784052A CN 201210278405 A CN201210278405 A CN 201210278405A CN 102831198 A CN102831198 A CN 102831198A
Authority
CN
China
Prior art keywords
document
signature
document signature
similar
token
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012102784052A
Other languages
Chinese (zh)
Inventor
温赟
杨青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PEOPLE SEARCH NETWORK AG
Original Assignee
PEOPLE SEARCH NETWORK AG
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by PEOPLE SEARCH NETWORK AG filed Critical PEOPLE SEARCH NETWORK AG
Priority to CN2012102784052A priority Critical patent/CN102831198A/en
Publication of CN102831198A publication Critical patent/CN102831198A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a similar document identifying device and a similar document identifying method based on a document signature technology. The similar document identifying device mainly comprises a content extracting module, a feature extracting module, a document signature computing module, a document signature indexing module and a similar document searching module. By the similar document identifying device and the similar document identifying method, problems that space complexity in an existing similar document identifying technology is high, the existing similar document identifying technology cannot meet application requirements on text streaming processing, a repeated text identifying technology with high space efficiency cannot identify similar texts, and the like are solved; and the similar document identifying method is a quick similarity identifying method for a large number of streaming documents.

Description

A kind of similar document recognition device and method based on the document signature technology
Technical field
The present invention relates to data mining and information retrieval technique, relate in particular to a kind of similar document recognition device and method based on the document signature technology.
Background technology
The alleged document of the present invention not only refers to traditional structured text document, also comprises multi-medium datas such as semi-structured HTML(Hypertext Markup Language) webpage, picture, video.In view of text class document range of application is wider, this instructions will be that example describes with text class document.
Similar document identification has significance for many applications.With the information retrieval field is example, has relevant statistics to point out, has repetition or similar web page more than 40% in the internet.In vertical search products such as news, video, picture, owing to operations such as reprinting, share has also produced a large amount of similar contents.Identify these similar web pages and not only help improving data-handling efficiency, more help to reduce the Search Results repetition rate to improve user experience.In addition, similar document is identified in fields such as plagiarizing detection, mechanical translation also has important application.
Traditional repeated text recognition technology scheme adopts cryptographic hash technology such as calculating document MD5 value, can only solve the identical repetitive file identification problem of content.Yet similar document a little change in the reprinting process possibly make and have some differences on the content, cause the inefficacy of cryptographic hash technology.
The main method that adopts based on vector space model (Vector Space Model) of text similarity identification at present; Like publication number is CN 102314418; Name is called the invention application (to call document 1 in the following text) of " a kind of based on context-sensitive Chinese similarity comparative approach "; It is abstracted into a vector in the text vector space with destination document; The keyword that occurs in the document is as a dimension of this vector, uses number of times that this keyword occurs in the document value as corresponding dimension usually.Can calculate the similarity measure of the cosine similarity of two vectors as two documents.The identification that has solved similar content to a certain extent based on the method for vector space model, but its space consuming is huge, needs the content-data of each document of storage, or still be proportional to the text vector information of document content length after the compression.Publication number is CN101576904; The name be called " a kind of based on have weight graph calculate the content of text similarity System and method for " invention application (to call document 2 in the following text); It adopts and from collection of document, has constructed weight graph; And based on the similarity between any two nodes in the weight graph calculating chart is arranged, and then obtain the method for the similarity of document.But this method can only be handled static collection of document, is not suitable for the application scenarios of streaming processing such as information retrieval.
Summary of the invention
In view of this; Fundamental purpose of the present invention is to provide a kind of similar document recognition device and method based on the document signature technology; With solve in the existing similar text identification technology space complexity high, can't tackle the application demand that the text streaming is handled, and the high repeated text recognition technology of space efficiency can't be discerned the problem of similar text etc. again; Also being provides the recognition methods of a kind of similarity fast for the magnanimity document of streaming.
For achieving the above object, technical scheme of the present invention is achieved in that
A kind of similar document recognition device based on the document signature technology mainly comprises content extraction module, the feature extraction module, and the document signature computing module, document signature index module and similar document are searched module; Wherein:
Content extraction module is used for the Document Title of extracting objects document, the word content of text, obtains body matter;
The feature extraction module is used for said body matter is converted into the character representation form of corresponding < token, weight>doublet set, and passes to said document signature computing module;
The document signature computing module is used for original token is converted into corresponding cryptographic hash, and combines the corresponding weight weight of current token to upgrade the document signature value, obtains the document signature value of final regular length;
The document signature index module is used for above-mentioned document signature will be stored in the document signature index module, or directly stores the set of whole signature storehouse; And
Similar document is searched module, in existing document signature index, searches and the document signature of its distance less than certain threshold value d, and will return the final ID of the corresponding document signature of similar document as destination document.
Wherein, said distance is binary-coded hamming distance, and said threshold value d is 3.
A kind of similar document recognition methods based on the document signature technology comprises:
The Document Title of A, extracting objects document, the word content of text obtain the step of body matter;
B, convert said body matter the character representation form of corresponding < token, weight>doublet set into, and pass to the step of said document signature computing module;
C, original token is converted into corresponding cryptographic hash, and combines the corresponding weight weight of current token to upgrade the document signature value, obtain the step of the document signature value of final regular length;
D, above-mentioned document signature is stored in the document signature index module or directly stores the step of whole signature storehouse set;
E, in existing document signature index, search and its distance less than the document signature of certain threshold value d, and will return the step of the corresponding document signature of similar document as the final ID of destination document.
Wherein, said steps A is specially:
A1, analyzing web page HTML html source code are found out the text block that comprises title, body matter information, in this process, remove irrelevant information;
A2, after removing irrelevant information in the text block that after steps A 1 is handled, obtains and handling, in the text chunk that obtains, use template matching method to remove noise information.
Said step B is specially:
B1, at first document is carried out word segmentation processing, obtain the term sequence of text word segmentation result;
B2, for k continuous in a term sequence term, form a characteristic token, parameter k is 2;
B3, for each token that constructs among the step B2, calculate corresponding weight weight, get number of times tf that token occurs as weight index in document content.
The process of document signature calculation is among the said step C:
For < the token that obtains behind the completing steps B; Weight>set; The character representation that is used as source document is passed to the document signature computing module, and this module is converted into corresponding cryptographic hash with original token successively, and combines the corresponding weight weight of current token to upgrade the document signature value; In accomplishing character representation, after the processing of all token, obtain the document signature value of final regular length.
Wherein, adopt the bits string representation document signature of 64bit, total can represent 2 64The state of kind.
Similar document recognition device and method based on the document signature technology provided by the present invention have the following advantages:
The present invention adopts the document signature technology that document is expressed as the document signature value of regular length, document similarity computational problem is converted into the computational problem of signature value distance, has solved the problem that the conventional cryptography salted hash Salted can't be discerned similar document.Compared to the similar document recognition methods based on vector space model, the document signature of regular length has greatly reduced storage space, more helps to handle efficiently mass data.The present invention also adopts the document signature of the existing collection of document of increment type index technology storage, and compares based on the signature of this index to destination document, thereby is applicable to the application scenarios of the streaming excavation of dynamic text stream.
Description of drawings
Fig. 1 is the similar document recognition device synoptic diagram that the present invention is based on the document signature technology;
Fig. 2 is the algorithm flow chart of document signature process in the step 3 of the present invention.
Embodiment
Below in conjunction with accompanying drawing and embodiments of the invention device and method of the present invention is done further detailed explanation.
It is high but can't discern similar content of text to the present invention is directed to existing repeated text recognition methods space efficiency; And based on the problems such as similar text recognition method space complexity height of vector space model; Proposed a kind of similar document recognition methods based on document signature technology, purpose is that the magnanimity document for streaming provides a kind of similar fast recognition methods.
Fig. 1 is the similar document recognition device synoptic diagram that the present invention is based on the document signature technology; As shown in Figure 1; For an embodiment of this similar document recognition device (similar news web page goes heavy service system) comprises five main functional modules: content extraction module; The feature extraction module, the document signature computing module, document signature index module and similar document are searched module; Said five functional modules are respectively applied for carries out five corresponding treatment steps.
Step 1: if for documents such as target news web pages, content extraction module will extract the word content of news (document) title, text.Particularly, be divided into two sub-steps again:
Step 11, analyzing web page html source code are found out the text block that comprises headline, body matter information, in this process, remove irrelevant informations such as advertisement link, navigation bar, help to improve the accuracy rate of similar identification.
Step 12, in the text block that after above-mentioned steps 11 is handled, obtains, remove irrelevant information such as html tag; And the method for in the text chunk that obtains, using template matches removes the common copyright statement text of each flash-news website, it " shared " noise information such as linked contents, further improves the precision that meaningful body matter extracts.
The contents extraction process is at first extracted wherein significant " text " content part C for destination document D, gets rid of insignificant noise information in the source document, thereby plays the purpose that improves the similar document recognition accuracy.
Step 2:, be converted into the character representation form of corresponding < token, weight>doublet set through the feature extraction module for accomplishing news web page (document) the body matter C that obtains after the above-mentioned steps 1.It will therefrom extract keyword token through the feature extraction module, and provide the weight weight of this keyword token, and corresponding < token, weight>doublet set will be as the character representation of source document.
Particularly, be divided into three sub-steps again:
Step 21, at first carry out word segmentation processing, obtain the term sequence of text word segmentation result for newsletter archive (document).
Step 22, for k continuous in a term sequence term; Form a characteristic token; In this embodiment, parameter k value is 2, and 1 parameter k is set to 1 than document; Added consideration among the present invention, can avoid the different mistakes identification of the identical but appearance of term content order better the term positional information.
Step 23, for each token that constructs in the step 22; Calculate corresponding weight weight; The number of times tf that employing token occurs in the document text content in embodiment of the present invention is as weight index; Directly adopt the simple strategy of unit weights in the document 2, embodiment of the present invention helps avoid the mistake relevant with word frequency and discerns.
Step 3 the: for < token that obtains behind the completing steps 2; Weight>set; The character representation that will be used as source news web page (document) passes to the document signature computing module; This module can be converted into corresponding cryptographic hash with original token successively, and combines the corresponding weight weight of current token to upgrade the document signature value.In accomplishing character representation, after the processing of all token, just can obtain the document signature value of final regular length.Its algorithm flow is shown in accompanying drawing 2.
In embodiment of the present invention, adopt the bits string representation document signature of 64bit (position), total can represent 2 64Plant different conditions.If necessary, can adjust signature length to adapt to the different application scenes demand.
Step 4: for historical news web page, the document signature of calculating through step 3 will all be stored in the document signature index module, and the simplest strategy is exactly directly storage whole signature storehouse set.
Through above-mentioned processing; Next step is that similar document signature in the step 5 is searched being linear complexity, in embodiment of the present invention, adopts the scheme of burst index; 64 bit signature bit string is divided into 4 16 bit bit strings; And be that key is stored in the corresponding index structure with 16 bit bit strings separately, in other words, whole index is made up of the subindex of 4 structural similarities jointly; Each subindex, is tabulated as value with all document signature of sharing this 16 bit substring as key by corresponding 16 bit.This index scheme is used certain storage redundancy, and similar document signature search procedure in the accelerating step 5 greatly is because the scope that key searches linear search is restricted to 1/2 of original scope 16
In addition, with the document signature index module it is processed into certain inside indexed format and stores, one can use compress technique to reduce the storage space expense, and two help to accelerate follow-up similar document searching speed
Compared to the technical scheme based on VSM, embodiment of the present invention has greatly compressed the storage space complexity based on the document signature of regular length.
Step 5: for the target news web page document signature S that calculates through step 3; In existing document signature index, search and the document signature of its distance less than certain threshold value d; If exist; Return the final ID of the corresponding document signature of similar document, otherwise return text signature value that step 3 calculates as document id as destination document.
In embodiment of the present invention; Adopt binary-coded hamming distance (Hamming Distance) as distance metric; Minimum similarity distance threshold parameter d elects 3 as; Mean that just the bit figure place that there are differences between two 64 bit bit strings is less than or equal at 3 o'clock, two corresponding news web pages will be considered to similar document.
On the news web page test data set, embodiment of the present invention has been obtained 95% accuracy rate, far above providing the accuracy rate index of technical scheme separately in document 1, the document 2.
The above is merely preferred embodiment of the present invention, is not to be used to limit protection scope of the present invention.

Claims (7)

1. the similar document recognition device based on the document signature technology is characterized in that, mainly comprises content extraction module, the feature extraction module, and the document signature computing module, document signature index module and similar document are searched module; Wherein:
Content extraction module is used for the Document Title of extracting objects document, the word content of text, obtains body matter;
The feature extraction module is used for said body matter is converted into the character representation form of corresponding < token, weight>doublet set, and passes to said document signature computing module;
The document signature computing module is used for original token is converted into corresponding cryptographic hash, and combines the corresponding weight weight of current token to upgrade the document signature value, obtains the document signature value of final regular length;
The document signature index module is used for above-mentioned document signature will be stored in the document signature index module, or directly stores the set of whole signature storehouse; And
Similar document is searched module, in existing document signature index, searches and the document signature of its distance less than certain threshold value d, and will return the final ID of the corresponding document signature of similar document as destination document.
2. the similar document recognition device based on the document signature technology according to claim 1 is characterized in that said distance is binary-coded hamming distance, and said threshold value d is 3.
3. the similar document recognition methods based on the document signature technology is characterized in that, comprising:
The Document Title of A, extracting objects document, the word content of text obtain the step of body matter;
B, convert said body matter the character representation form of corresponding < token, weight>doublet set into, and pass to the step of said document signature computing module;
C, original token is converted into corresponding cryptographic hash, and combines the corresponding weight weight of current token to upgrade the document signature value, obtain the step of the document signature value of final regular length;
D, above-mentioned document signature is stored in the document signature index module or directly stores the step of whole signature storehouse set;
E, in existing document signature index, search and its distance less than the document signature of certain threshold value d, and will return the step of the corresponding document signature of similar document as the final ID of destination document.
4. the similar document recognition methods based on the document signature technology according to claim 3 is characterized in that said steps A is specially:
A1, analyzing web page HTML html source code are found out the text block that comprises title, body matter information, in this process, remove irrelevant information;
A2, after removing irrelevant information in the text block that after steps A 1 is handled, obtains and handling, in the text chunk that obtains, use template matching method to remove noise information.
5. the similar document recognition methods based on the document signature technology according to claim 3 is characterized in that said step B is specially:
B1, at first document is carried out word segmentation processing, obtain the term sequence of text word segmentation result;
B2, for k continuous in a term sequence term, form a characteristic token, parameter k is 2;
B3, for each token that constructs among the step B2, calculate corresponding weight weight, get number of times tf that token occurs as weight index in document content.
6. the similar document recognition methods based on the document signature technology according to claim 3 is characterized in that the process of document signature calculation is among the said step C:
For < the token that obtains behind the completing steps B; Weight>set; The character representation that is used as source document is passed to the document signature computing module, and this module is converted into corresponding cryptographic hash with original token successively, and combines the corresponding weight weight of current token to upgrade the document signature value; In accomplishing character representation, after the processing of all token, obtain the document signature value of final regular length.
7. the similar document recognition methods based on the document signature technology according to claim 6 is characterized in that, adopts the bits string representation document signature of 64bit, and total can represent 2 64The state of kind.
CN2012102784052A 2012-08-07 2012-08-07 Similar document identifying device and similar document identifying method based on document signature technology Pending CN102831198A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012102784052A CN102831198A (en) 2012-08-07 2012-08-07 Similar document identifying device and similar document identifying method based on document signature technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012102784052A CN102831198A (en) 2012-08-07 2012-08-07 Similar document identifying device and similar document identifying method based on document signature technology

Publications (1)

Publication Number Publication Date
CN102831198A true CN102831198A (en) 2012-12-19

Family

ID=47334335

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012102784052A Pending CN102831198A (en) 2012-08-07 2012-08-07 Similar document identifying device and similar document identifying method based on document signature technology

Country Status (1)

Country Link
CN (1) CN102831198A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063318A (en) * 2014-06-24 2014-09-24 湘潭大学 Rapid Android application similarity detection method
CN104079560A (en) * 2014-06-05 2014-10-01 腾讯科技(深圳)有限公司 Web address security detecting method and device and server
CN104615681A (en) * 2015-01-21 2015-05-13 广州神马移动信息科技有限公司 Text selecting method and device
CN104715194A (en) * 2013-12-13 2015-06-17 北京启明星辰信息安全技术有限公司 Malicious software detection method and device
CN104967693A (en) * 2015-07-15 2015-10-07 中南民族大学 Document similarity calculation method facing cloud storage based on fully homomorphic password technology
CN105589847A (en) * 2015-12-22 2016-05-18 北京奇虎科技有限公司 Weighted article identification method and device
CN106326197A (en) * 2016-08-23 2017-01-11 达而观信息科技(上海)有限公司 Method for fast detecting repeated copying texts
CN106326388A (en) * 2016-08-17 2017-01-11 乐视控股(北京)有限公司 Method and device for processing information
CN103646029B (en) * 2013-11-04 2017-03-15 北京中搜网络技术股份有限公司 A kind of similarity calculating method for blog article
CN108304379A (en) * 2018-01-15 2018-07-20 腾讯科技(深圳)有限公司 A kind of article recognition methods, device and storage medium
CN108427714A (en) * 2018-02-02 2018-08-21 北京邮电大学 The source of houses based on machine learning repeats record recognition methods and system
CN110309446A (en) * 2019-04-26 2019-10-08 深圳市赛为智能股份有限公司 The quick De-weight method of content of text, device, computer equipment and storage medium
CN110377558A (en) * 2019-06-14 2019-10-25 平安科技(深圳)有限公司 Document searching method, device, computer equipment and storage medium
CN111737966A (en) * 2020-06-11 2020-10-02 北京百度网讯科技有限公司 Document repetition degree detection method, device, equipment and readable storage medium
CN115329050A (en) * 2022-10-12 2022-11-11 北京金堤科技有限公司 Information tracing method and device, storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090030891A1 (en) * 2007-07-26 2009-01-29 Siemens Aktiengesellschaft Method and apparatus for extraction of textual content from hypertext web documents
CN101645082A (en) * 2009-04-17 2010-02-10 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode
CN102024065A (en) * 2011-01-18 2011-04-20 中南大学 SIMD optimization-based webpage duplication elimination and concurrency method
CN102402537A (en) * 2010-09-15 2012-04-04 盛乐信息技术(上海)有限公司 Chinese web page text deduplication system and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090030891A1 (en) * 2007-07-26 2009-01-29 Siemens Aktiengesellschaft Method and apparatus for extraction of textual content from hypertext web documents
CN101645082A (en) * 2009-04-17 2010-02-10 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode
CN102402537A (en) * 2010-09-15 2012-04-04 盛乐信息技术(上海)有限公司 Chinese web page text deduplication system and method
CN102024065A (en) * 2011-01-18 2011-04-20 中南大学 SIMD optimization-based webpage duplication elimination and concurrency method

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103646029B (en) * 2013-11-04 2017-03-15 北京中搜网络技术股份有限公司 A kind of similarity calculating method for blog article
CN104715194A (en) * 2013-12-13 2015-06-17 北京启明星辰信息安全技术有限公司 Malicious software detection method and device
CN104715194B (en) * 2013-12-13 2018-03-27 北京启明星辰信息安全技术有限公司 Malware detection method and apparatus
CN104079560A (en) * 2014-06-05 2014-10-01 腾讯科技(深圳)有限公司 Web address security detecting method and device and server
CN104063318A (en) * 2014-06-24 2014-09-24 湘潭大学 Rapid Android application similarity detection method
CN104615681A (en) * 2015-01-21 2015-05-13 广州神马移动信息科技有限公司 Text selecting method and device
CN104967693A (en) * 2015-07-15 2015-10-07 中南民族大学 Document similarity calculation method facing cloud storage based on fully homomorphic password technology
CN104967693B (en) * 2015-07-15 2018-05-18 中南民族大学 Towards the Documents Similarity computational methods based on full homomorphism cryptographic technique of cloud storage
CN105589847A (en) * 2015-12-22 2016-05-18 北京奇虎科技有限公司 Weighted article identification method and device
CN105589847B (en) * 2015-12-22 2019-02-15 北京奇虎科技有限公司 The article identification method and device of Weight
CN106326388A (en) * 2016-08-17 2017-01-11 乐视控股(北京)有限公司 Method and device for processing information
CN106326197A (en) * 2016-08-23 2017-01-11 达而观信息科技(上海)有限公司 Method for fast detecting repeated copying texts
CN108304379A (en) * 2018-01-15 2018-07-20 腾讯科技(深圳)有限公司 A kind of article recognition methods, device and storage medium
CN108304379B (en) * 2018-01-15 2020-12-01 腾讯科技(深圳)有限公司 Article identification method and device and storage medium
CN108427714A (en) * 2018-02-02 2018-08-21 北京邮电大学 The source of houses based on machine learning repeats record recognition methods and system
CN110309446A (en) * 2019-04-26 2019-10-08 深圳市赛为智能股份有限公司 The quick De-weight method of content of text, device, computer equipment and storage medium
CN110377558A (en) * 2019-06-14 2019-10-25 平安科技(深圳)有限公司 Document searching method, device, computer equipment and storage medium
CN110377558B (en) * 2019-06-14 2023-06-20 平安科技(深圳)有限公司 Document query method, device, computer equipment and storage medium
CN111737966A (en) * 2020-06-11 2020-10-02 北京百度网讯科技有限公司 Document repetition degree detection method, device, equipment and readable storage medium
CN111737966B (en) * 2020-06-11 2024-03-01 北京百度网讯科技有限公司 Document repetition detection method, device, equipment and readable storage medium
CN115329050A (en) * 2022-10-12 2022-11-11 北京金堤科技有限公司 Information tracing method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN102831198A (en) Similar document identifying device and similar document identifying method based on document signature technology
US11321421B2 (en) Method, apparatus and device for generating entity relationship data, and storage medium
CN103049568B (en) The method of the document classification to magnanimity document library
Choudhury et al. Figure metadata extraction from digital documents
CN101807208B (en) Method for quickly retrieving video fingerprints
CN102402537A (en) Chinese web page text deduplication system and method
CN110750615B (en) Text repeatability judgment method and device, electronic equipment and storage medium
CN101079025A (en) File correlation computing system and method
CN109271487A (en) A kind of Similar Text analysis method
CN103646029A (en) Similarity calculation method for blog articles
CN111325033B (en) Entity identification method, entity identification device, electronic equipment and computer readable storage medium
CN112527948A (en) Data real-time duplicate removal method and system based on sentence-level index
CN110674635B (en) Method and device for dividing text paragraphs
CN113627132B (en) Data deduplication marking code generation method, system, electronic equipment and storage medium
CN109472020B (en) Feature alignment Chinese word segmentation method
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
JP2010182238A (en) Citation detection device, device and method for creating original document database, program and recording medium
CN103118028B (en) Based on the security sweep method and system of web analysis
WO2013063734A1 (en) Determining document structure similarity using discrete wavelet transformation
US20090182759A1 (en) Extracting entities from a web page
CN112364647A (en) Duplicate checking method based on cosine similarity algorithm
Wang et al. A novel web page text information extraction method
CN116975340A (en) Information retrieval method, apparatus, device, program product, and storage medium
Li et al. Real-time video copy detection based on Hadoop
CN109992716B (en) Indonesia similar news recommendation method based on ITQ algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20121219