CN107679073A - A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage - Google Patents
A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage Download PDFInfo
- Publication number
- CN107679073A CN107679073A CN201710742190.8A CN201710742190A CN107679073A CN 107679073 A CN107679073 A CN 107679073A CN 201710742190 A CN201710742190 A CN 201710742190A CN 107679073 A CN107679073 A CN 107679073A
- Authority
- CN
- China
- Prior art keywords
- webpage
- compressed
- participle
- finger print
- fingerprint base
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/325—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Collating Specific Patterns (AREA)
Abstract
The present invention relates to a kind of compressed webpage fingerprint base construction method and the quick similarity matching methods of compressed webpage.This method includes:Hofmann decoding is carried out to Gzip compressed webpages, obtains half decompression webpage;Half-and-half decompression webpage carries out word segmentation processing, and chooses the participle that can characterize webpage as feature participle;Dimension-reduction treatment is carried out to the feature participle of webpage, generates one-dimensional finger print information;Fingerprint base is built according to the finger print information of webpage;One-dimensional finger print information is generated using same method to online Gzip compressed webpages, and similarity system design is carried out with the fingerprint in the fingerprint base of Gzip compressed webpages, is determined whether according to the similarity threshold of setting similar.The similitude that webpage is compressed using the method for the present invention is matched, and can effectively improve the efficiency of the similitude matching of compressed webpage.
Description
Technical field
The invention belongs to network safety filed, and in particular to a kind of fingerprint base construction method for compressed webpage and quick
Similarity matching methods.
Background technology
At present, it is local and overseas to have substantial amounts of harmful sites, and these harmful sites cause ill effect to network environment.
Control harmful sites crime to caused by society, crowd to endanger, be always the problem of national governments worry.
Advertising service business Doubleclick data shows before flow in the website of 500, have under Google
Dozens of is harmful adult web sites.The data that U.S. Business Insider are announced, harmful sites account for global website and integrally counted
The 12% of amount.On average, if 10 websites can be strolled daily, it is possible one of them be exactly harmful sites.《Commercial value》In
Report, monthly visit capacity can reach 4,400,000,000 to maximum adult's harmful sites in the world, and this numeral is the 2 of Jingdone district visit capacity
Times, even more the 11 of youku.com's visit capacity times.And it arranges the 34th on authoritative website ranking Alexa list this moment, BBC, day
All get rid of from far away behind the well-known website such as cat.Repeatedly in net net special campaigns, operator and fail-safe software developer use
Various methods detect and close the harmful sites such as pornographic, gambling or extension horse, the source that cut-out invalid information is propagated.
However, illegal website is using replacing website private server, replacing website domain name and uses the multiple technologies such as mirror images agency
Means escape detection, cause that this kind of harmful sites can not be detected by using the means that active probe or active crawl.
The method analyzed using passive flux and find harmful sites online, for safeguarding that network environment has important depth
Remote meaning.During being analyzed based on passive flux, generally require and Content Advisor is carried out using the method for deep-packet detection, or
Web page contents are taken the fingerprint using hash algorithms such as md5 or sha1, then matched again using the webpage to newly capturing
Method detect harmful sites.But but there is a situation where in real webpage flow largely containing Variations similar word webpage, such as
Examining certainly for server, causes sensitive word to be replaced, above two method fails.And the keyword filter list of server
Difference generates different webpages, causes its cryptographic Hash also can be different, same failure.Mirror images proxy web site is accessed, or
The same server page obtained using distinct device (such as PC and mobile phone), different time, is equally had in page layout or interior
Difference in appearance, or even the minor variations of local message displaying all can cause the finger print information of whole webpage to be changed completely;Equally
The method of md5 or sha1 Hash can not be used to be matched.
Therefore during passive flux is analyzed, because keyword examination and fingerprint matching failure contain, it is necessary to be directed to
The webpage for having alternative word carries out similitude webpage matching.
Most of website service business and application program all apply Gzip and compress the content that they are provided.And Gzip nets
Page has become HTTP1.1 basic norm.Client and service end all support the transmission of Gzip compressed webpages.IE
Edge, Firefox, Chrome, Safari, sogou browser, 360 browsers etc. support the webpage that Gzip forms are transmitted.
And the function that the built-in Gzip such as Nginx, Microsoft-IIS, Apache, Tomcat compresses in terms of service end, easy configuration
The Gzip compressions of website can be opened.For example nginx only needs to open in conf/nginx.conf write-ins Gzip on.And
And the file type of compression can be selected, such as css, javascript, html file etc..In terms of linear flow rate accounting,
Alexa, 3000 website has 50% to enable Gzip compressions before ranking.The more forward number of site of ranking, wherein opening Gzip
That compresses is higher, or even the website for having 65% in top100 website opens Gzip compressions.Likewise, in real network flow
Middle Gzip compressions content accounting is equally very big.Measured collection is understood from the data of national Internet emergency center, in network
The flow of real transmission, in the data of text class, the flow for having 65% is transmitted in the form of Gzip compression, and in quantity
The accounting in face is 66%.Accounting has been above 60%.
Gzip compressed webpages flow occupies major part in network flow, certainly will be examined during passive flux is analyzed
Consider influence of the Gzip compressions for similar web page matching technique.But Gzip compressed webpages coding ciphertext, parsing are difficult:Gzip
Compression is made up of content of the Gzip heads plus the generation of Deflate algorithms.And Deflate algorithms are to carry out two steps by original text
Compression obtains-LZ77 codings and (static and dynamic) huffman coding.And Gzip decompressions are time-consuming and consumption is largely counted
Calculate resource.
The parsing of Gzip compressed webpages is difficult, and slow-footed feature examines that the demand for ' fast ' forms contradiction with online, because
This needs to propose faster matching algorithm for Gzip features.
The content of the invention
In order to solve the problems, such as that the quick similitude matching speed of Gzip compressed webpages is slow, the present invention proposes one kind and is directed to
The quick similarity matching methods of compressed webpage of Gzip compression algorithms, go for the intruding detection system in express network.
The technical solution adopted by the present invention is as follows:
A kind of compressed webpage fingerprint base construction method, comprises the following steps:
1) Hofmann decoding is carried out to Gzip compressed webpages, obtains half decompression webpage;
2) half-and-half decompression webpage carries out word segmentation processing, and chooses the participle that can characterize webpage as feature participle;
3) dimension-reduction treatment is carried out to the feature participle of webpage, generates one-dimensional finger print information;
4) fingerprint base is built according to the finger print information of webpage.
Further, after step 1) obtains half decompression webpage, formed to pointer therein and due to pointer reason
Imperfect participle is removed processing.
Further, step 2) carries out word segmentation processing using IKAnalyzer algorithms, and calculates institute using TF-IDF algorithms
There is the weight of participle, segmented to choosing a certain amount of participle after weight sequencing as the feature for characterizing webpage.
Further, step 3) is segmented to feature using Simhash algorithms and carries out dimension-reduction treatment, generates one-dimensional fingerprint letter
Breath.
Further, step 4) builds fingerprint base by combining piezomagnetic principle and dictionary tree, by by chain sheet form
The form that cryptographic Hash is improved to dictionary tree forms index at a high speed.
A kind of compressed webpage similarity matching methods, comprise the following steps:
1) Hofmann decoding is carried out to online Gzip compressed webpages, obtains half decompression webpage;
2) half-and-half decompression webpage carries out word segmentation processing, and chooses the participle that can characterize webpage as feature participle;
3) dimension-reduction treatment is carried out to the feature participle of webpage, generates one-dimensional finger print information;
4) fingerprint in the fingerprint base of the finger print information of the online webpage of generation and Gzip compressed webpages is subjected to similitude
Compare, and determined whether according to the similarity threshold of setting similar.
Further, step 1) builds fingerprint base according to the Gzip compressed webpages of harmful sites, as malice Sample Storehouse;Step
It is rapid 4) by the way that the finger print information in the finger print information of online webpage and malice Sample Storehouse is carried out into similarity system design, it is online to judge
Whether webpage is malicious web pages.
Further, step 1) builds fingerprint base by combining piezomagnetic principle and dictionary tree, by by chain sheet form
The form that cryptographic Hash is improved to dictionary tree forms index at a high speed;, will be according to piezomagnetic principle when step 4) carries out similarity system design
The finger print information of gauze page is divided into n blocks, then matches n blocks in Hash table respectively, finds corresponding cryptographic Hash, i.e.,
Dictionary tree.
Further, step 4) determines similarity by calculating the Hamming distances of the cryptographic Hash of finger print information;For word
The calculating of Hamming distances in allusion quotation tree, cut operator is carried out by safeguarding a most rickle, to accelerate matching speed.
A kind of server, the server include memory and processor, and the memory storage computer program is described
Computer program is configured as by the computing device, and the computer program includes being used to perform in method described above respectively
The instruction of step.
A kind of computer-readable recording medium for storing computer program, when the computer program is computer-executed,
The step of realizing method described above.
Beneficial effects of the present invention are as follows:
The present invention carries out the extraction after feature extraction, rather than complete solution compression on the basis of being decompressed half, and combines
The high speed that dictionary tree constructs needed for similarity retrieval with piezomagnetic principle indexes.Compared to the method for existing complete solution compression, adopt
The similitude that webpage is compressed with the method for the present invention matches, and speed can lift 40% or so, be effectively improved compression
The efficiency of the similitude matching of webpage.
Brief description of the drawings
Fig. 1 is the general frame figure of the quick similarity matching methods of compressed webpage.
Fig. 2 is to index schematic diagram at a high speed.
Fig. 3 is dictionary tree schematic diagram.
Embodiment
Below by specific embodiments and the drawings, the present invention is described in further details.
The purpose of core of the quick similitude matching of compressed webpage is to accelerate of malicious web pages in online network traffics
Match somebody with somebody, reduce the time-consuming of whole process.Based on this, present invention optimizes the framework that traditional similitude matches, specifically for compression
Webpage is improved, and its major design includes the following aspects:
1) extraction of the feature participle under half decompression states:Gzip compression algorithms add Gzip heads by Deflate algorithms
Portion and trailer information composition, wherein Gzip payload segments are the formation of Deflate algorithms.Deflate algorithms need to pass through respectively
LZ77 is encoded and the step of huffman coding two obtains.The extraction segmented on the basis of being decompressed half, i.e., in LZ77 codings
On the basis of carry out.The principle of LZ77 codings is that the content repeated is substituted for into a pointer (to point to the distance of above duplicate contents
Length to) form be compressed.So being segmented on LZ77 codings, ignore pointer first, have no effect on point
The effect of word, because the part removed is all referring to the content to repetition.And after such processing whole segmentation methods it is defeated
Entering part will substantially reduce, and reduce overall time overhead.
2) index at a high speed:The web page fingerprint of generation needs to calculate sea compared with carrying out two-by-two with the fingerprint inside malice Sample Storehouse
Prescribed distance just can determine that similarity.But each is compared for inline system two-by-two, speed is too slow, it is therefore desirable to
Build the index structure of high speed.The form of piezomagnetic principle and dictionary tree is combined at this, is not dramatically increasing the situation of internal memory
Under, the speed matched in fingerprint base is accelerated, the feedback result of similarity can be more quickly provided.
The overall framework of the present invention is as shown in figure 1, two modules divided by dotted line, left side are the structure of malice Sample Storehouse
Module, right side are the processing module of online webpage, and D represents the Gzip compressed webpages used during structure malice Sample Storehouse, and R1 is represented
Online Gzip compressed webpages, C1~C4 represent the half decompression webpage obtained after progress Hofmann decoding, and F1~F4 represents to carry
The feature string taken, I1~I3, T1 represent that using the fingerprint obtained after Simhash algorithm process S represents the index to be formed, and A is warp
Cross the similar document that similarity system design obtains.
First malice Sample Storehouse structure module the step of be mainly:
1) processing of half decompression webpage:File is subjected to Hofmann decoding first, then on the basis of half decompresses,
That is pointer (distance length to) is carried out skipping processing on LZ77 codings, then handles imperfect participle successively.
2) extraction of feature participle:Quick Chinese Word Automatic Segmentation is chosen, file is half-and-half decompressed and is segmented.This implementation
IKAnalyzer storehouses are selected to carry out the word segmentation processing of webpage in example.Then TF-IDF (Term Frequency- are used to participle
Inverse Document Frequency) algorithm calculating weight.Choose Top50 feature participle.
3) Feature Dimension Reduction:Simhash calculating is carried out to the feature participle of extraction, ultimately produces one-dimensional finger print information.
4) structure of fingerprint base:The finger print information of magnanimity compressed webpage is obtained according to step 3), so as to form fingerprint base.This
The cryptographic Hash of chain sheet form is improved to the form of dictionary tree, can put down by place by being improved to traditional piezomagnetic principle
Weigh in the case of space, accelerate the speed matched in magnanimity fingerprint.
Secondly it is the key step of the processing module of online webpage:
1) pretreatment module and subsequent treatment module:By the screening of early stage, suspicious webpage is screened, then
Carry out half decompression in next step.Finally, if webpage finds extremely similar webpage in malice Sample Storehouse, then we
It is malice to be considered as the webpage, can carry out subsequent operation, otherwise, clearance processing.The step can use prior art real
It is existing, it is not the main contents of the present invention, therefore be no longer described in detail.
2) processing of half decompression webpage:With the step 1) in the processing of Sample Storehouse above;
3) extraction of feature participle:With the step 2) in the processing of Sample Storehouse above;
4) Feature Dimension Reduction:With the step 3) in the processing of Sample Storehouse above;
5) retrieval in fingerprint base:Pass through the cryptographic Hash of finger print information in the finger print information and fingerprint base of more online webpage
Hamming distances calculate similarity.Fingerprint is divided into by n blocks according to piezomagnetic principle, then entered n blocks in Hash table respectively
Row matching, finds corresponding cryptographic Hash, i.e. dictionary tree.The calculating of magnanimity Hamming distances in dictionary tree needs to safeguard one most
Rickle, to carry out cut operator, to accelerate matching speed.
Below, above-mentioned operating procedure is done respectively and discussed in detail.
1. the processing of half decompression webpage:
Compressed webpage either inside line compression webpage and malice Sample Storehouse is required for since the step, and it is opened
Pin and result have very big influence for subsequent step, and algorithm steps are as follows:
1) Huffman decompression is multiplexed, decompression is divided into adaptive Huffman and static Huffman and decompresses two species
Type.
2) half decompression content of LZ77 codings is obtained, removes the content of pointer part, it is same for pointer reason shape
Into imperfect participle be removed processing, in order to avoid influence participle processing.Due to LZ77 principles, the duplicate contents in 32KB are several
It is removed, the file ultimately produced only has clear content.
After the step process is completed, by contrast test, above processing time, time overhead can reduce 20~
60%, differ greatly, have High relevancy with raw text content.The reduction of time overhead at this, can promote overall processing
The shortening of flow time.Also, the file ultimately generated is partly decompressed to compare with original document, be sized to reduction 15~
60%, difference is equally larger, has High relevancy with raw text content.The file content of reduction, the input as subsequent treatment
Part, it is possible to reduce follow-up expense, accelerate processing speed.
2. the extraction of feature participle:
This module is responsible for the participle vector that extraction is best able to characterize webpage:
1) in order to accelerate the speed handled online, the real-time demand of big flow is met, it is necessary to choose most fast Words partition system
Handle the present embodiment after it compared for each Chinese automatic word-cut, it have chosen IKAnalyzer algorithms and segmented;
2) participle of webpage can be characterized, it is necessary to calculate the weight of all participles in order to choose, chosen after sequence a small amount of
, select in the present embodiment TF-IDF traditional algorithms to carry out the calculating of weight, finally choose Top50 participle as table
Levy the participle vector of webpage;
This step obtains 50 participle vectors of a webpage after terminating, the vector of selection is as the feature for characterizing webpage
Can be as the input of subsequent treatment.
3. characteristic vector dimensionality reduction:
The similarity that can after participle vector calculates webpage is obtained, traditional similarity calculating method cosine is similar
Degree calculating needs to consume substantial amounts of computing resource.Therefore the processing of progress dimensionality reduction is needed herein, accelerates the calculating speed of similitude.
Its processing step is as follows:
1) cryptographic Hash calculating is carried out to the participle extracted in previous step using conventional md5 algorithms or sha1 algorithms;
2) weight:By the cryptographic Hash of binary representation, each is multiplied by the weighted value that TF-IDF calculates gained respectively;
3) merge:The result of weighting is added up, and obtains the weighted value after a merging;
4) dimensionality reduction:The value that will be greater than 0 is expressed as 1, and the value less than or equal to zero is expressed as 0.A string of binary bits strings are obtained,
The binary bits string is exactly fingerprint.
4. the structure of index at a high speed:
Calculating similarity needs to compare two-by-two, but on the premise of magnanimity fingerprint, it is found that similar webpage needs to carry out
It is substantial amounts of to calculate.It is, thus, sought for a kind of method faster carrys out speed up processing.Here in connection with piezomagnetic principle and dictionary
Tree, constructs a kind of more quick index.Its step is as follows:
1) assume that threshold value is k, finger print information is uniformly divided into k+1 data block;
2) k+1 data block has just been split if there is no corresponding in fig 2 as the key of Hash table
Data block, then just using the data block the divided key value new as Hash table;If there is just without creating new Hash table
Key values.
3) fingerprint is appended to corresponding to Hash key inside cryptographic Hash respectively, cryptographic Hash here is exactly dictionary tree, word
The construction method of allusion quotation tree is identical with the construction method of traditional dictionary tree, prior art can be used to realize, therefore no longer carries out detailed
Illustrate.
Fig. 2 is the schematic diagram for the high speed index established.Wherein, Si represents the finger that the used webpage of structure index is formed
Line, KEY1~KEY4 represent the k+1 data block that fingerprint is split, the content at Hash key in dotted line frame for Hash table key,
Here using fingerprint-block as key values, the content at Hash value in dotted line frame is the web page fingerprint of dictionary tree form storage, word
Each web page fingerprint includes the data block contents of key values in allusion quotation tree.
5. the matching process of fingerprint:
Index construct finish, it is necessary to by the fingerprint of the online webpage of extraction high speed index in matched, that is, carry out phase
Compare like property.By comparing two-by-two, the number of the position differed in the fingerprint of generation, i.e. Hamming distances are calculated.Specify a threshold
Value, it is if Hamming distances are more than the threshold value, i.e., dissimilar, it is otherwise, similar.The threshold value is empirical value, generally 2 or 3, with finger
The length of line is related.Specific the step of being matched, is as follows:
1) it is k to arrange threshold value, and finger print information is divided into k+1 data blocks.
2) search k+1 data block respectively in Hash table, be present in if finding key in Hash table, then return pair
The dictionary tree answered, otherwise terminate to match.
3) dictionary tree enters as shown in figure 3, in dictionary tree from root root nodes to the two of leaf node paths traversed
System string is exactly the Simhash values that a corresponding webpage is generated.Final A-J represents that dictionary tree is closed after reaching terminal
The document of connection, it is leaf node here because length is the same.Six " X " symbols in Fig. 3 represent to lose in whole matching process
The coupling path lost.
4) matching in dictionary tree needs to safeguard a most rickle, is initially put into root two child nodes, most rickle
In key be child node attribute Hamming distances.
5) heap top element is taken out every time and carry out depth-first search, choose smaller Hamming distances in two child nodes every time
Node continues depth-first search, and another node is put back in heap if being not more than threshold value k.
If 6) Hamming distances are more than heap top element after continuing traversal, just stop and be put back into inside heap, take out existing
Heap top element continues depth-first search.
7) it is less than threshold value k complete dictionary tree of fingerprint or traversal until finding.If last match explanation malice
Have in Sample Storehouse with the extremely similar webpage of online webpage, that is, think that the online webpage is malice, can subsequently be located
Reason.
The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this area
Technical scheme can be modified by personnel or equivalent substitution, without departing from the spirit and scope of the present invention, this
The protection domain of invention should be to be defined described in claims.
Claims (10)
1. a kind of compressed webpage fingerprint base construction method, it is characterised in that comprise the following steps:
1) Hofmann decoding is carried out to Gzip compressed webpages, obtains half decompression webpage;
2) half-and-half decompression webpage carries out word segmentation processing, and chooses the participle that can characterize webpage as feature participle;
3) dimension-reduction treatment is carried out to the feature participle of webpage, generates one-dimensional finger print information;
4) fingerprint base is built according to the finger print information of webpage.
2. the fingerprint base construction method of compressed webpage as claimed in claim 1, it is characterised in that step 1) obtains half decompression
After webpage, to pointer therein and because the imperfect participle that pointer reason is formed is removed processing.
3. the fingerprint base construction method of compressed webpage as claimed in claim 1, it is characterised in that step 2) uses
IKAnalyzer algorithms carry out word segmentation processing, and the weight of all participles is calculated using TF-IDF algorithms, to being selected after weight sequencing
A certain amount of participle is taken as the feature participle for characterizing webpage;Step 3) is segmented to feature using Simhash algorithms and carries out dimensionality reduction
Processing, generates one-dimensional finger print information.
4. the fingerprint base construction method of compressed webpage as claimed in claim 1, it is characterised in that step 4) passes through with reference to Pigeon Hole
Principle builds fingerprint base with dictionary tree, and high speed rope is formed by the form that the cryptographic Hash of chain sheet form is improved to dictionary tree
Draw.
5. a kind of compressed webpage similarity matching methods, it is characterised in that comprise the following steps:
1) Hofmann decoding is carried out to online Gzip compressed webpages, obtains half decompression webpage;
2) half-and-half decompression webpage carries out word segmentation processing, and chooses the participle that can characterize webpage as feature participle;
3) dimension-reduction treatment is carried out to the feature participle of webpage, generates one-dimensional finger print information;
4) fingerprint in the fingerprint base of the finger print information of the online webpage of generation and Gzip compressed webpages is subjected to similarity system design,
And determined whether according to the similarity threshold of setting similar.
6. the similarity matching methods of compressed webpage as claimed in claim 5, it is characterised in that step 1) is according to harmful sites
Gzip compressed webpages structure fingerprint base, as malice Sample Storehouse;Step 4) passes through by the finger print information of online webpage and maliciously
Finger print information in Sample Storehouse carries out similarity system design, to judge whether online webpage is malicious web pages.
7. the similarity matching methods of the compressed webpage as described in claim 5 or 6, it is characterised in that step 1) passes through combination
Piezomagnetic principle builds fingerprint base with dictionary tree, is formed at a high speed by the form that the cryptographic Hash of chain sheet form is improved to dictionary tree
Index;When step 4) carries out similarity system design, the finger print information of online webpage is divided into by n blocks according to piezomagnetic principle, then by n
Block is matched in Hash table respectively, finds corresponding cryptographic Hash, i.e. dictionary tree.
8. the similarity matching methods of compressed webpage as claimed in claim 7, it is characterised in that step 4) is by calculating fingerprint
The Hamming distances of the cryptographic Hash of information determine similarity;Calculating for Hamming distances in dictionary tree, by safeguarding one most
Rickle carries out cut operator, to accelerate matching speed.
9. a kind of server, it is characterised in that the server includes memory and processor, the memory storage computer
Program, the computer program are configured as by the computing device, and the computer program includes will for perform claim
Ask the instruction of each step in any claim methods described in 1 to 8.
10. a kind of computer-readable recording medium for storing computer program, it is characterised in that the computer program is calculated
When machine performs, the step of realizing any claim methods described in claim 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710742190.8A CN107679073A (en) | 2017-08-25 | 2017-08-25 | A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710742190.8A CN107679073A (en) | 2017-08-25 | 2017-08-25 | A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107679073A true CN107679073A (en) | 2018-02-09 |
Family
ID=61135370
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710742190.8A Pending CN107679073A (en) | 2017-08-25 | 2017-08-25 | A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107679073A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109450452A (en) * | 2018-11-27 | 2019-03-08 | 中国科学院计算技术研究所 | A kind of compression method and system of the sampling dictionary tree index for gene data |
CN110245314A (en) * | 2019-05-31 | 2019-09-17 | 江苏百达智慧网络科技有限公司 | A kind of web page fingerprint generation method |
CN111899821A (en) * | 2020-06-28 | 2020-11-06 | 广州万孚生物技术股份有限公司 | Method for processing medical institution data, method and device for constructing database |
CN112099725A (en) * | 2019-06-17 | 2020-12-18 | 华为技术有限公司 | Data processing method and device and computer readable storage medium |
CN112788159A (en) * | 2020-12-31 | 2021-05-11 | 山西三友和智慧信息技术股份有限公司 | Webpage fingerprint tracking method based on DNS traffic and KNN algorithm |
CN113518088A (en) * | 2021-07-12 | 2021-10-19 | 北京百度网讯科技有限公司 | Data processing method, device, server, client and medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104079559A (en) * | 2014-06-05 | 2014-10-01 | 腾讯科技(深圳)有限公司 | Web address security detecting method and device and server |
CN106372105A (en) * | 2016-08-19 | 2017-02-01 | 中国科学院信息工程研究所 | Spark platform-based microblog data preprocessing method |
CN106980656A (en) * | 2017-03-10 | 2017-07-25 | 北京大学 | A kind of searching method based on two-value code dictionary tree |
-
2017
- 2017-08-25 CN CN201710742190.8A patent/CN107679073A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104079559A (en) * | 2014-06-05 | 2014-10-01 | 腾讯科技(深圳)有限公司 | Web address security detecting method and device and server |
CN106372105A (en) * | 2016-08-19 | 2017-02-01 | 中国科学院信息工程研究所 | Spark platform-based microblog data preprocessing method |
CN106980656A (en) * | 2017-03-10 | 2017-07-25 | 北京大学 | A kind of searching method based on two-value code dictionary tree |
Non-Patent Citations (3)
Title |
---|
付章杰等: "Privacy-Preserving Smart Similarity Search Based on Simhash over Encrypted Data in Cloud Computing", 《JOURNAL OF INTERNET TECHNOLOGY》 * |
杨嵘等: "基于Simhash的压缩文档相似性检索研究", 《 HTTPS://MESALAB.CN/F/SCIENTIFICACHIEVEMENT/PAPERLIST?ACHIEVEPUBTIME=2016-01-01&TYPE=%E8%AE%BA%E6%96%87&SUBTYPE=&SELECTEDID=》 * |
杨嵘等: "基于Simhash的压缩文档相似性检索研究", 《HTTPS://MESALAB.CN/F/SCIENTIFICACHIEVEMENT/PAPERLIST?ACHIEVEPUBTIME=2016-01-01&TYPE=%E8%AE%BA%E6%96%87&SUBTYPE=&SELECTEDID=》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109450452A (en) * | 2018-11-27 | 2019-03-08 | 中国科学院计算技术研究所 | A kind of compression method and system of the sampling dictionary tree index for gene data |
CN109450452B (en) * | 2018-11-27 | 2020-07-10 | 中国科学院计算技术研究所 | Compression method and system for sampling dictionary tree index aiming at gene data |
CN110245314A (en) * | 2019-05-31 | 2019-09-17 | 江苏百达智慧网络科技有限公司 | A kind of web page fingerprint generation method |
CN112099725A (en) * | 2019-06-17 | 2020-12-18 | 华为技术有限公司 | Data processing method and device and computer readable storage medium |
US11797204B2 (en) | 2019-06-17 | 2023-10-24 | Huawei Technologies Co., Ltd. | Data compression processing method and apparatus, and computer-readable storage medium |
CN111899821A (en) * | 2020-06-28 | 2020-11-06 | 广州万孚生物技术股份有限公司 | Method for processing medical institution data, method and device for constructing database |
CN112788159A (en) * | 2020-12-31 | 2021-05-11 | 山西三友和智慧信息技术股份有限公司 | Webpage fingerprint tracking method based on DNS traffic and KNN algorithm |
CN112788159B (en) * | 2020-12-31 | 2022-07-08 | 山西三友和智慧信息技术股份有限公司 | Webpage fingerprint tracking method based on DNS traffic and KNN algorithm |
CN113518088A (en) * | 2021-07-12 | 2021-10-19 | 北京百度网讯科技有限公司 | Data processing method, device, server, client and medium |
CN113518088B (en) * | 2021-07-12 | 2023-07-07 | 北京百度网讯科技有限公司 | Data processing method, device, server, client and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107679073A (en) | A kind of quick similarity matching methods of compressed webpage fingerprint base construction method and compressed webpage | |
CN107516041B (en) | WebShell detection method and system based on deep neural network | |
Mao et al. | Phishing page detection via learning classifiers from page layout feature | |
CN108737423B (en) | Phishing website discovery method and system based on webpage key content similarity analysis | |
US20210110039A1 (en) | Real-time javascript classifier | |
CN109766693A (en) | A kind of cross-site scripting attack detection method based on deep learning | |
Zhang et al. | Webshell traffic detection with character-level features based on deep learning | |
CN106708952B (en) | A kind of Webpage clustering method and device | |
US11336689B1 (en) | Detecting phishing websites via a machine learning-based system using URL feature hashes, HTML encodings and embedded images of content pages | |
CN107341399A (en) | Assess the method and device of code file security | |
US11438377B1 (en) | Machine learning-based systems and methods of using URLs and HTML encodings for detecting phishing websites | |
US11444978B1 (en) | Machine learning-based system for detecting phishing websites using the URLS, word encodings and images of content pages | |
CN112989348B (en) | Attack detection method, model training method, device, server and storage medium | |
CN115080756A (en) | Attack and defense behavior and space-time information extraction method oriented to threat information map | |
Yuan et al. | A novel approach for malicious URL detection based on the joint model | |
CN103324886A (en) | Method and system for extracting fingerprint database in network intrusion detection | |
Yu et al. | Detecting malicious web requests using an enhanced textcnn | |
Yan et al. | Cross-site scripting attack detection based on a modified convolution neural network | |
Benavides-Astudillo et al. | Comparative Study of Deep Learning Algorithms in the Detection of Phishing Attacks Based on HTML and Text Obtained from Web Pages | |
Hu et al. | Cross-site scripting detection with two-channel feature fusion embedded in self-attention mechanism | |
CN112883373A (en) | PHP type WebShell detection method and detection system thereof | |
CN113918936A (en) | SQL injection attack detection method and device | |
US20230353595A1 (en) | Content-based deep learning for inline phishing detection | |
Phung et al. | Data augmentation of JavaScript dataset using DCGAN and random seed | |
CN112632549B (en) | Web attack detection method based on context analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180209 |
|
WD01 | Invention patent application deemed withdrawn after publication |