CN107679073A - A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage - Google Patents

A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage Download PDF

Info

Publication number
CN107679073A
CN107679073A CN201710742190.8A CN201710742190A CN107679073A CN 107679073 A CN107679073 A CN 107679073A CN 201710742190 A CN201710742190 A CN 201710742190A CN 107679073 A CN107679073 A CN 107679073A
Authority
CN
China
Prior art keywords
webpage
compressed
participle
finger print
fingerprint base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710742190.8A
Other languages
Chinese (zh)
Inventor
杨嵘
张斌
张鹏
杨威
李舒
窦凤虎
刘庆云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201710742190.8A priority Critical patent/CN107679073A/en
Publication of CN107679073A publication Critical patent/CN107679073A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Collating Specific Patterns (AREA)

Abstract

The present invention relates to a kind of compressed webpage fingerprint base construction method and the quick similarity matching methods of compressed webpage.This method includes:Hofmann decoding is carried out to Gzip compressed webpages, obtains half decompression webpage;Half-and-half decompression webpage carries out word segmentation processing, and chooses the participle that can characterize webpage as feature participle;Dimension-reduction treatment is carried out to the feature participle of webpage, generates one-dimensional finger print information;Fingerprint base is built according to the finger print information of webpage;One-dimensional finger print information is generated using same method to online Gzip compressed webpages, and similarity system design is carried out with the fingerprint in the fingerprint base of Gzip compressed webpages, is determined whether according to the similarity threshold of setting similar.The similitude that webpage is compressed using the method for the present invention is matched, and can effectively improve the efficiency of the similitude matching of compressed webpage.

Description

A kind of quick similitude matching of compressed webpage fingerprint base construction method and compressed webpage Method
Technical field
The invention belongs to network safety filed, and in particular to a kind of fingerprint base construction method for compressed webpage and quick Similarity matching methods.
Background technology
At present, it is local and overseas to have substantial amounts of harmful sites, and these harmful sites cause ill effect to network environment. Control harmful sites crime to caused by society, crowd to endanger, be always the problem of national governments worry.
Advertising service business Doubleclick data shows before flow in the website of 500, have under Google Dozens of is harmful adult web sites.The data that U.S. Business Insider are announced, harmful sites account for global website and integrally counted The 12% of amount.On average, if 10 websites can be strolled daily, it is possible one of them be exactly harmful sites.《Commercial value》In Report, monthly visit capacity can reach 4,400,000,000 to maximum adult's harmful sites in the world, and this numeral is the 2 of Jingdone district visit capacity Times, even more the 11 of youku.com's visit capacity times.And it arranges the 34th on authoritative website ranking Alexa list this moment, BBC, day All get rid of from far away behind the well-known website such as cat.Repeatedly in net net special campaigns, operator and fail-safe software developer use Various methods detect and close the harmful sites such as pornographic, gambling or extension horse, the source that cut-out invalid information is propagated. However, illegal website is using replacing website private server, replacing website domain name and uses the multiple technologies such as mirror images agency Means escape detection, cause that this kind of harmful sites can not be detected by using the means that active probe or active crawl.
The method analyzed using passive flux and find harmful sites online, for safeguarding that network environment has important depth Remote meaning.During being analyzed based on passive flux, generally require and Content Advisor is carried out using the method for deep-packet detection, or Web page contents are taken the fingerprint using hash algorithms such as md5 or sha1, then matched again using the webpage to newly capturing Method detect harmful sites.But but there is a situation where in real webpage flow largely containing Variations similar word webpage, such as Examining certainly for server, causes sensitive word to be replaced, above two method fails.And the keyword filter list of server Difference generates different webpages, causes its cryptographic Hash also can be different, same failure.Mirror images proxy web site is accessed, or The same server page obtained using distinct device (such as PC and mobile phone), different time, is equally had in page layout or interior Difference in appearance, or even the minor variations of local message displaying all can cause the finger print information of whole webpage to be changed completely;Equally The method of md5 or sha1 Hash can not be used to be matched.
Therefore during passive flux is analyzed, because keyword examination and fingerprint matching failure contain, it is necessary to be directed to The webpage for having alternative word carries out similitude webpage matching.
Most of website service business and application program all apply Gzip and compress the content that they are provided.And Gzip nets Page has become HTTP1.1 basic norm.Client and service end all support the transmission of Gzip compressed webpages.IE Edge, Firefox, Chrome, Safari, sogou browser, 360 browsers etc. support the webpage that Gzip forms are transmitted. And the function that the built-in Gzip such as Nginx, Microsoft-IIS, Apache, Tomcat compresses in terms of service end, easy configuration The Gzip compressions of website can be opened.For example nginx only needs to open in conf/nginx.conf write-ins Gzip on.And And the file type of compression can be selected, such as css, javascript, html file etc..In terms of linear flow rate accounting, Alexa, 3000 website has 50% to enable Gzip compressions before ranking.The more forward number of site of ranking, wherein opening Gzip That compresses is higher, or even the website for having 65% in top100 website opens Gzip compressions.Likewise, in real network flow Middle Gzip compressions content accounting is equally very big.Measured collection is understood from the data of national Internet emergency center, in network The flow of real transmission, in the data of text class, the flow for having 65% is transmitted in the form of Gzip compression, and in quantity The accounting in face is 66%.Accounting has been above 60%.
Gzip compressed webpages flow occupies major part in network flow, certainly will be examined during passive flux is analyzed Consider influence of the Gzip compressions for similar web page matching technique.But Gzip compressed webpages coding ciphertext, parsing are difficult:Gzip Compression is made up of content of the Gzip heads plus the generation of Deflate algorithms.And Deflate algorithms are to carry out two steps by original text Compression obtains-LZ77 codings and (static and dynamic) huffman coding.And Gzip decompressions are time-consuming and consumption is largely counted Calculate resource.
The parsing of Gzip compressed webpages is difficult, and slow-footed feature examines that the demand for ' fast ' forms contradiction with online, because This needs to propose faster matching algorithm for Gzip features.
The content of the invention
In order to solve the problems, such as that the quick similitude matching speed of Gzip compressed webpages is slow, the present invention proposes one kind and is directed to The quick similarity matching methods of compressed webpage of Gzip compression algorithms, go for the intruding detection system in express network.
The technical solution adopted by the present invention is as follows:
A kind of compressed webpage fingerprint base construction method, comprises the following steps:
1) Hofmann decoding is carried out to Gzip compressed webpages, obtains half decompression webpage;
2) half-and-half decompression webpage carries out word segmentation processing, and chooses the participle that can characterize webpage as feature participle;
3) dimension-reduction treatment is carried out to the feature participle of webpage, generates one-dimensional finger print information;
4) fingerprint base is built according to the finger print information of webpage.
Further, after step 1) obtains half decompression webpage, formed to pointer therein and due to pointer reason Imperfect participle is removed processing.
Further, step 2) carries out word segmentation processing using IKAnalyzer algorithms, and calculates institute using TF-IDF algorithms There is the weight of participle, segmented to choosing a certain amount of participle after weight sequencing as the feature for characterizing webpage.
Further, step 3) is segmented to feature using Simhash algorithms and carries out dimension-reduction treatment, generates one-dimensional fingerprint letter Breath.
Further, step 4) builds fingerprint base by combining piezomagnetic principle and dictionary tree, by by chain sheet form The form that cryptographic Hash is improved to dictionary tree forms index at a high speed.
A kind of compressed webpage similarity matching methods, comprise the following steps:
1) Hofmann decoding is carried out to online Gzip compressed webpages, obtains half decompression webpage;
2) half-and-half decompression webpage carries out word segmentation processing, and chooses the participle that can characterize webpage as feature participle;
3) dimension-reduction treatment is carried out to the feature participle of webpage, generates one-dimensional finger print information;
4) fingerprint in the fingerprint base of the finger print information of the online webpage of generation and Gzip compressed webpages is subjected to similitude Compare, and determined whether according to the similarity threshold of setting similar.
Further, step 1) builds fingerprint base according to the Gzip compressed webpages of harmful sites, as malice Sample Storehouse;Step It is rapid 4) by the way that the finger print information in the finger print information of online webpage and malice Sample Storehouse is carried out into similarity system design, it is online to judge Whether webpage is malicious web pages.
Further, step 1) builds fingerprint base by combining piezomagnetic principle and dictionary tree, by by chain sheet form The form that cryptographic Hash is improved to dictionary tree forms index at a high speed;, will be according to piezomagnetic principle when step 4) carries out similarity system design The finger print information of gauze page is divided into n blocks, then matches n blocks in Hash table respectively, finds corresponding cryptographic Hash, i.e., Dictionary tree.
Further, step 4) determines similarity by calculating the Hamming distances of the cryptographic Hash of finger print information;For word The calculating of Hamming distances in allusion quotation tree, cut operator is carried out by safeguarding a most rickle, to accelerate matching speed.
A kind of server, the server include memory and processor, and the memory storage computer program is described Computer program is configured as by the computing device, and the computer program includes being used to perform in method described above respectively The instruction of step.
A kind of computer-readable recording medium for storing computer program, when the computer program is computer-executed, The step of realizing method described above.
Beneficial effects of the present invention are as follows:
The present invention carries out the extraction after feature extraction, rather than complete solution compression on the basis of being decompressed half, and combines The high speed that dictionary tree constructs needed for similarity retrieval with piezomagnetic principle indexes.Compared to the method for existing complete solution compression, adopt The similitude that webpage is compressed with the method for the present invention matches, and speed can lift 40% or so, be effectively improved compression The efficiency of the similitude matching of webpage.
Brief description of the drawings
Fig. 1 is the general frame figure of the quick similarity matching methods of compressed webpage.
Fig. 2 is to index schematic diagram at a high speed.
Fig. 3 is dictionary tree schematic diagram.
Embodiment
Below by specific embodiments and the drawings, the present invention is described in further details.
The purpose of core of the quick similitude matching of compressed webpage is to accelerate of malicious web pages in online network traffics Match somebody with somebody, reduce the time-consuming of whole process.Based on this, present invention optimizes the framework that traditional similitude matches, specifically for compression Webpage is improved, and its major design includes the following aspects:
1) extraction of the feature participle under half decompression states:Gzip compression algorithms add Gzip heads by Deflate algorithms Portion and trailer information composition, wherein Gzip payload segments are the formation of Deflate algorithms.Deflate algorithms need to pass through respectively LZ77 is encoded and the step of huffman coding two obtains.The extraction segmented on the basis of being decompressed half, i.e., in LZ77 codings On the basis of carry out.The principle of LZ77 codings is that the content repeated is substituted for into a pointer (to point to the distance of above duplicate contents Length to) form be compressed.So being segmented on LZ77 codings, ignore pointer first, have no effect on point The effect of word, because the part removed is all referring to the content to repetition.And after such processing whole segmentation methods it is defeated Entering part will substantially reduce, and reduce overall time overhead.
2) index at a high speed:The web page fingerprint of generation needs to calculate sea compared with carrying out two-by-two with the fingerprint inside malice Sample Storehouse Prescribed distance just can determine that similarity.But each is compared for inline system two-by-two, speed is too slow, it is therefore desirable to Build the index structure of high speed.The form of piezomagnetic principle and dictionary tree is combined at this, is not dramatically increasing the situation of internal memory Under, the speed matched in fingerprint base is accelerated, the feedback result of similarity can be more quickly provided.
The overall framework of the present invention is as shown in figure 1, two modules divided by dotted line, left side are the structure of malice Sample Storehouse Module, right side are the processing module of online webpage, and D represents the Gzip compressed webpages used during structure malice Sample Storehouse, and R1 is represented Online Gzip compressed webpages, C1~C4 represent the half decompression webpage obtained after progress Hofmann decoding, and F1~F4 represents to carry The feature string taken, I1~I3, T1 represent that using the fingerprint obtained after Simhash algorithm process S represents the index to be formed, and A is warp Cross the similar document that similarity system design obtains.
First malice Sample Storehouse structure module the step of be mainly:
1) processing of half decompression webpage:File is subjected to Hofmann decoding first, then on the basis of half decompresses, That is pointer (distance length to) is carried out skipping processing on LZ77 codings, then handles imperfect participle successively.
2) extraction of feature participle:Quick Chinese Word Automatic Segmentation is chosen, file is half-and-half decompressed and is segmented.This implementation IKAnalyzer storehouses are selected to carry out the word segmentation processing of webpage in example.Then TF-IDF (Term Frequency- are used to participle Inverse Document Frequency) algorithm calculating weight.Choose Top50 feature participle.
3) Feature Dimension Reduction:Simhash calculating is carried out to the feature participle of extraction, ultimately produces one-dimensional finger print information.
4) structure of fingerprint base:The finger print information of magnanimity compressed webpage is obtained according to step 3), so as to form fingerprint base.This The cryptographic Hash of chain sheet form is improved to the form of dictionary tree, can put down by place by being improved to traditional piezomagnetic principle Weigh in the case of space, accelerate the speed matched in magnanimity fingerprint.
Secondly it is the key step of the processing module of online webpage:
1) pretreatment module and subsequent treatment module:By the screening of early stage, suspicious webpage is screened, then Carry out half decompression in next step.Finally, if webpage finds extremely similar webpage in malice Sample Storehouse, then we It is malice to be considered as the webpage, can carry out subsequent operation, otherwise, clearance processing.The step can use prior art real It is existing, it is not the main contents of the present invention, therefore be no longer described in detail.
2) processing of half decompression webpage:With the step 1) in the processing of Sample Storehouse above;
3) extraction of feature participle:With the step 2) in the processing of Sample Storehouse above;
4) Feature Dimension Reduction:With the step 3) in the processing of Sample Storehouse above;
5) retrieval in fingerprint base:Pass through the cryptographic Hash of finger print information in the finger print information and fingerprint base of more online webpage Hamming distances calculate similarity.Fingerprint is divided into by n blocks according to piezomagnetic principle, then entered n blocks in Hash table respectively Row matching, finds corresponding cryptographic Hash, i.e. dictionary tree.The calculating of magnanimity Hamming distances in dictionary tree needs to safeguard one most Rickle, to carry out cut operator, to accelerate matching speed.
Below, above-mentioned operating procedure is done respectively and discussed in detail.
1. the processing of half decompression webpage:
Compressed webpage either inside line compression webpage and malice Sample Storehouse is required for since the step, and it is opened Pin and result have very big influence for subsequent step, and algorithm steps are as follows:
1) Huffman decompression is multiplexed, decompression is divided into adaptive Huffman and static Huffman and decompresses two species Type.
2) half decompression content of LZ77 codings is obtained, removes the content of pointer part, it is same for pointer reason shape Into imperfect participle be removed processing, in order to avoid influence participle processing.Due to LZ77 principles, the duplicate contents in 32KB are several It is removed, the file ultimately produced only has clear content.
After the step process is completed, by contrast test, above processing time, time overhead can reduce 20~ 60%, differ greatly, have High relevancy with raw text content.The reduction of time overhead at this, can promote overall processing The shortening of flow time.Also, the file ultimately generated is partly decompressed to compare with original document, be sized to reduction 15~ 60%, difference is equally larger, has High relevancy with raw text content.The file content of reduction, the input as subsequent treatment Part, it is possible to reduce follow-up expense, accelerate processing speed.
2. the extraction of feature participle:
This module is responsible for the participle vector that extraction is best able to characterize webpage:
1) in order to accelerate the speed handled online, the real-time demand of big flow is met, it is necessary to choose most fast Words partition system Handle the present embodiment after it compared for each Chinese automatic word-cut, it have chosen IKAnalyzer algorithms and segmented;
2) participle of webpage can be characterized, it is necessary to calculate the weight of all participles in order to choose, chosen after sequence a small amount of , select in the present embodiment TF-IDF traditional algorithms to carry out the calculating of weight, finally choose Top50 participle as table Levy the participle vector of webpage;
This step obtains 50 participle vectors of a webpage after terminating, the vector of selection is as the feature for characterizing webpage Can be as the input of subsequent treatment.
3. characteristic vector dimensionality reduction:
The similarity that can after participle vector calculates webpage is obtained, traditional similarity calculating method cosine is similar Degree calculating needs to consume substantial amounts of computing resource.Therefore the processing of progress dimensionality reduction is needed herein, accelerates the calculating speed of similitude. Its processing step is as follows:
1) cryptographic Hash calculating is carried out to the participle extracted in previous step using conventional md5 algorithms or sha1 algorithms;
2) weight:By the cryptographic Hash of binary representation, each is multiplied by the weighted value that TF-IDF calculates gained respectively;
3) merge:The result of weighting is added up, and obtains the weighted value after a merging;
4) dimensionality reduction:The value that will be greater than 0 is expressed as 1, and the value less than or equal to zero is expressed as 0.A string of binary bits strings are obtained, The binary bits string is exactly fingerprint.
4. the structure of index at a high speed:
Calculating similarity needs to compare two-by-two, but on the premise of magnanimity fingerprint, it is found that similar webpage needs to carry out It is substantial amounts of to calculate.It is, thus, sought for a kind of method faster carrys out speed up processing.Here in connection with piezomagnetic principle and dictionary Tree, constructs a kind of more quick index.Its step is as follows:
1) assume that threshold value is k, finger print information is uniformly divided into k+1 data block;
2) k+1 data block has just been split if there is no corresponding in fig 2 as the key of Hash table Data block, then just using the data block the divided key value new as Hash table;If there is just without creating new Hash table Key values.
3) fingerprint is appended to corresponding to Hash key inside cryptographic Hash respectively, cryptographic Hash here is exactly dictionary tree, word The construction method of allusion quotation tree is identical with the construction method of traditional dictionary tree, prior art can be used to realize, therefore no longer carries out detailed Illustrate.
Fig. 2 is the schematic diagram for the high speed index established.Wherein, Si represents the finger that the used webpage of structure index is formed Line, KEY1~KEY4 represent the k+1 data block that fingerprint is split, the content at Hash key in dotted line frame for Hash table key, Here using fingerprint-block as key values, the content at Hash value in dotted line frame is the web page fingerprint of dictionary tree form storage, word Each web page fingerprint includes the data block contents of key values in allusion quotation tree.
5. the matching process of fingerprint:
Index construct finish, it is necessary to by the fingerprint of the online webpage of extraction high speed index in matched, that is, carry out phase Compare like property.By comparing two-by-two, the number of the position differed in the fingerprint of generation, i.e. Hamming distances are calculated.Specify a threshold Value, it is if Hamming distances are more than the threshold value, i.e., dissimilar, it is otherwise, similar.The threshold value is empirical value, generally 2 or 3, with finger The length of line is related.Specific the step of being matched, is as follows:
1) it is k to arrange threshold value, and finger print information is divided into k+1 data blocks.
2) search k+1 data block respectively in Hash table, be present in if finding key in Hash table, then return pair The dictionary tree answered, otherwise terminate to match.
3) dictionary tree enters as shown in figure 3, in dictionary tree from root root nodes to the two of leaf node paths traversed System string is exactly the Simhash values that a corresponding webpage is generated.Final A-J represents that dictionary tree is closed after reaching terminal The document of connection, it is leaf node here because length is the same.Six " X " symbols in Fig. 3 represent to lose in whole matching process The coupling path lost.
4) matching in dictionary tree needs to safeguard a most rickle, is initially put into root two child nodes, most rickle In key be child node attribute Hamming distances.
5) heap top element is taken out every time and carry out depth-first search, choose smaller Hamming distances in two child nodes every time Node continues depth-first search, and another node is put back in heap if being not more than threshold value k.
If 6) Hamming distances are more than heap top element after continuing traversal, just stop and be put back into inside heap, take out existing Heap top element continues depth-first search.
7) it is less than threshold value k complete dictionary tree of fingerprint or traversal until finding.If last match explanation malice Have in Sample Storehouse with the extremely similar webpage of online webpage, that is, think that the online webpage is malice, can subsequently be located Reason.
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this area Technical scheme can be modified by personnel or equivalent substitution, without departing from the spirit and scope of the present invention, this The protection domain of invention should be to be defined described in claims.

Claims (10)

1. a kind of compressed webpage fingerprint base construction method, it is characterised in that comprise the following steps:
1) Hofmann decoding is carried out to Gzip compressed webpages, obtains half decompression webpage;
2) half-and-half decompression webpage carries out word segmentation processing, and chooses the participle that can characterize webpage as feature participle;
3) dimension-reduction treatment is carried out to the feature participle of webpage, generates one-dimensional finger print information;
4) fingerprint base is built according to the finger print information of webpage.
2. the fingerprint base construction method of compressed webpage as claimed in claim 1, it is characterised in that step 1) obtains half decompression After webpage, to pointer therein and because the imperfect participle that pointer reason is formed is removed processing.
3. the fingerprint base construction method of compressed webpage as claimed in claim 1, it is characterised in that step 2) uses IKAnalyzer algorithms carry out word segmentation processing, and the weight of all participles is calculated using TF-IDF algorithms, to being selected after weight sequencing A certain amount of participle is taken as the feature participle for characterizing webpage;Step 3) is segmented to feature using Simhash algorithms and carries out dimensionality reduction Processing, generates one-dimensional finger print information.
4. the fingerprint base construction method of compressed webpage as claimed in claim 1, it is characterised in that step 4) passes through with reference to Pigeon Hole Principle builds fingerprint base with dictionary tree, and high speed rope is formed by the form that the cryptographic Hash of chain sheet form is improved to dictionary tree Draw.
5. a kind of compressed webpage similarity matching methods, it is characterised in that comprise the following steps:
1) Hofmann decoding is carried out to online Gzip compressed webpages, obtains half decompression webpage;
2) half-and-half decompression webpage carries out word segmentation processing, and chooses the participle that can characterize webpage as feature participle;
3) dimension-reduction treatment is carried out to the feature participle of webpage, generates one-dimensional finger print information;
4) fingerprint in the fingerprint base of the finger print information of the online webpage of generation and Gzip compressed webpages is subjected to similarity system design, And determined whether according to the similarity threshold of setting similar.
6. the similarity matching methods of compressed webpage as claimed in claim 5, it is characterised in that step 1) is according to harmful sites Gzip compressed webpages structure fingerprint base, as malice Sample Storehouse;Step 4) passes through by the finger print information of online webpage and maliciously Finger print information in Sample Storehouse carries out similarity system design, to judge whether online webpage is malicious web pages.
7. the similarity matching methods of the compressed webpage as described in claim 5 or 6, it is characterised in that step 1) passes through combination Piezomagnetic principle builds fingerprint base with dictionary tree, is formed at a high speed by the form that the cryptographic Hash of chain sheet form is improved to dictionary tree Index;When step 4) carries out similarity system design, the finger print information of online webpage is divided into by n blocks according to piezomagnetic principle, then by n Block is matched in Hash table respectively, finds corresponding cryptographic Hash, i.e. dictionary tree.
8. the similarity matching methods of compressed webpage as claimed in claim 7, it is characterised in that step 4) is by calculating fingerprint The Hamming distances of the cryptographic Hash of information determine similarity;Calculating for Hamming distances in dictionary tree, by safeguarding one most Rickle carries out cut operator, to accelerate matching speed.
9. a kind of server, it is characterised in that the server includes memory and processor, the memory storage computer Program, the computer program are configured as by the computing device, and the computer program includes will for perform claim Ask the instruction of each step in any claim methods described in 1 to 8.
10. a kind of computer-readable recording medium for storing computer program, it is characterised in that the computer program is calculated When machine performs, the step of realizing any claim methods described in claim 1 to 8.
CN201710742190.8A 2017-08-25 2017-08-25 A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage Pending CN107679073A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710742190.8A CN107679073A (en) 2017-08-25 2017-08-25 A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710742190.8A CN107679073A (en) 2017-08-25 2017-08-25 A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage

Publications (1)

Publication Number Publication Date
CN107679073A true CN107679073A (en) 2018-02-09

Family

ID=61135370

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710742190.8A Pending CN107679073A (en) 2017-08-25 2017-08-25 A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage

Country Status (1)

Country Link
CN (1) CN107679073A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109450452A (en) * 2018-11-27 2019-03-08 中国科学院计算技术研究所 A kind of compression method and system of the sampling dictionary tree index for gene data
CN110245314A (en) * 2019-05-31 2019-09-17 江苏百达智慧网络科技有限公司 A kind of web page fingerprint generation method
CN111899821A (en) * 2020-06-28 2020-11-06 广州万孚生物技术股份有限公司 Method for processing medical institution data, method and device for constructing database
CN112099725A (en) * 2019-06-17 2020-12-18 华为技术有限公司 Data processing method and device and computer readable storage medium
CN112788159A (en) * 2020-12-31 2021-05-11 山西三友和智慧信息技术股份有限公司 Webpage fingerprint tracking method based on DNS traffic and KNN algorithm
CN113518088A (en) * 2021-07-12 2021-10-19 北京百度网讯科技有限公司 Data processing method, device, server, client and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104079559A (en) * 2014-06-05 2014-10-01 腾讯科技(深圳)有限公司 Web address security detecting method and device and server
CN106372105A (en) * 2016-08-19 2017-02-01 中国科学院信息工程研究所 Spark platform-based microblog data preprocessing method
CN106980656A (en) * 2017-03-10 2017-07-25 北京大学 A kind of searching method based on two-value code dictionary tree

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104079559A (en) * 2014-06-05 2014-10-01 腾讯科技(深圳)有限公司 Web address security detecting method and device and server
CN106372105A (en) * 2016-08-19 2017-02-01 中国科学院信息工程研究所 Spark platform-based microblog data preprocessing method
CN106980656A (en) * 2017-03-10 2017-07-25 北京大学 A kind of searching method based on two-value code dictionary tree

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
付章杰等: "Privacy-Preserving Smart Similarity Search Based on Simhash over Encrypted Data in Cloud Computing", 《JOURNAL OF INTERNET TECHNOLOGY》 *
杨嵘等: "基于Simhash的压缩文档相似性检索研究", 《 HTTPS://MESALAB.CN/F/SCIENTIFICACHIEVEMENT/PAPERLIST?ACHIEVEPUBTIME=2016-01-01&TYPE=%E8%AE%BA%E6%96%87&SUBTYPE=&SELECTEDID=》 *
杨嵘等: "基于Simhash的压缩文档相似性检索研究", 《HTTPS://MESALAB.CN/F/SCIENTIFICACHIEVEMENT/PAPERLIST?ACHIEVEPUBTIME=2016-01-01&TYPE=%E8%AE%BA%E6%96%87&SUBTYPE=&SELECTEDID=》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109450452A (en) * 2018-11-27 2019-03-08 中国科学院计算技术研究所 A kind of compression method and system of the sampling dictionary tree index for gene data
CN109450452B (en) * 2018-11-27 2020-07-10 中国科学院计算技术研究所 Compression method and system for sampling dictionary tree index aiming at gene data
CN110245314A (en) * 2019-05-31 2019-09-17 江苏百达智慧网络科技有限公司 A kind of web page fingerprint generation method
CN112099725A (en) * 2019-06-17 2020-12-18 华为技术有限公司 Data processing method and device and computer readable storage medium
US11797204B2 (en) 2019-06-17 2023-10-24 Huawei Technologies Co., Ltd. Data compression processing method and apparatus, and computer-readable storage medium
CN111899821A (en) * 2020-06-28 2020-11-06 广州万孚生物技术股份有限公司 Method for processing medical institution data, method and device for constructing database
CN112788159A (en) * 2020-12-31 2021-05-11 山西三友和智慧信息技术股份有限公司 Webpage fingerprint tracking method based on DNS traffic and KNN algorithm
CN112788159B (en) * 2020-12-31 2022-07-08 山西三友和智慧信息技术股份有限公司 Webpage fingerprint tracking method based on DNS traffic and KNN algorithm
CN113518088A (en) * 2021-07-12 2021-10-19 北京百度网讯科技有限公司 Data processing method, device, server, client and medium
CN113518088B (en) * 2021-07-12 2023-07-07 北京百度网讯科技有限公司 Data processing method, device, server, client and medium

Similar Documents

Publication Publication Date Title
CN107679073A (en) A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage
CN107516041B (en) WebShell detection method and system based on deep neural network
Mao et al. Phishing page detection via learning classifiers from page layout feature
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
US20210110039A1 (en) Real-time javascript classifier
CN109766693A (en) A kind of cross-site scripting attack detection method based on deep learning
Zhang et al. Webshell traffic detection with character-level features based on deep learning
CN106708952B (en) A kind of Webpage clustering method and device
US11336689B1 (en) Detecting phishing websites via a machine learning-based system using URL feature hashes, HTML encodings and embedded images of content pages
CN107341399A (en) Assess the method and device of code file security
US11438377B1 (en) Machine learning-based systems and methods of using URLs and HTML encodings for detecting phishing websites
US11444978B1 (en) Machine learning-based system for detecting phishing websites using the URLS, word encodings and images of content pages
CN112989348B (en) Attack detection method, model training method, device, server and storage medium
CN115080756A (en) Attack and defense behavior and space-time information extraction method oriented to threat information map
Yuan et al. A novel approach for malicious URL detection based on the joint model
CN103324886A (en) Method and system for extracting fingerprint database in network intrusion detection
Yu et al. Detecting malicious web requests using an enhanced textcnn
Yan et al. Cross-site scripting attack detection based on a modified convolution neural network
Benavides-Astudillo et al. Comparative Study of Deep Learning Algorithms in the Detection of Phishing Attacks Based on HTML and Text Obtained from Web Pages
Hu et al. Cross-site scripting detection with two-channel feature fusion embedded in self-attention mechanism
CN112883373A (en) PHP type WebShell detection method and detection system thereof
CN113918936A (en) SQL injection attack detection method and device
US20230353595A1 (en) Content-based deep learning for inline phishing detection
Phung et al. Data augmentation of JavaScript dataset using DCGAN and random seed
CN112632549B (en) Web attack detection method based on context analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180209

WD01 Invention patent application deemed withdrawn after publication