New! View global litigation for patent families

CN102629261A - Method for finding landing page from phishing page - Google Patents

Method for finding landing page from phishing page Download PDF

Info

Publication number
CN102629261A
CN102629261A CN 201210051171 CN201210051171A CN102629261A CN 102629261 A CN102629261 A CN 102629261A CN 201210051171 CN201210051171 CN 201210051171 CN 201210051171 A CN201210051171 A CN 201210051171A CN 102629261 A CN102629261 A CN 102629261A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
web
page
phishing
pages
method
Prior art date
Application number
CN 201210051171
Other languages
Chinese (zh)
Other versions
CN102629261B (en )
Inventor
周国富
周国强
张卫丰
张迎周
王慕妮
田先桃
许碧欢
陆柳敏
顾赛赛
Original Assignee
南京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Abstract

The invention provides a method for finding a landing page from a phishing web page. The method comprises the followings: firstly, keywords are extracted from web page text and web graphics, so as to form the lexical signature of the phishing web page, then searching for the lexical signature is performed on a plurality of search engines, the front K most relevant web pages are found out by synthesizing the results of those search engines, the K web pages and the phishing web pages are kept in a picture form, an image perception hash sequence is extracted, finally, the Hamming distances between the K web graphics and the phishing web graphics can be respectively calculated, and one or more lawful web pages simulated by the phishing web page can be selected according to the sizes of the distances.

Description

由钓鱼网页查找目标网页的方法 Find a phishing website landing page method

技术领域 FIELD

[0001] 本发明涉及一种由钓鱼网页查找目标网页的方法,主要从钓鱼网页与对应目标网页之间在文本和图像特征之间的相似性来查找目标网页,以更新钓鱼检测时所需要的白名单,属于信息安全领域。 [0001] The present invention relates to a target web page to find a method of fishing, fishing mainly between a corresponding target web page and the similarity between the text and image features to find the target page, the required updates to the phishing detection white list, belong to the field of information security.

[0002] [0002]

背景技术 Background technique

[0003] 钓鱼网站是随着网络普及和在线交易的増加而变得异常猖獗的网络诈骗行为。 [0003] phishing sites, with the popularity of the network and enlargement of online transactions plus network and become very rampant fraud. 钓鱼网站是犯罪分子模仿合法网页做出的诈骗网站,钓鱼网站通常与银行网站或其他知名网站几乎完全相同,从而引诱网站使用者在钓鱼网站上提交出敏感信息,如:用户名、ロ令、银行帐号或信用卡详细信息等。 Phishing sites are scam sites by criminals imitate legitimate pages make, phishing sites often, thus luring users to submit website website with banks or other well-known sites are almost identical on the phishing sites out sensitive information, such as: user name, ro make, bank account or credit card details and so on.

[0004] 最典型的网络钓鱼攻击过程如下:首先将用户引诱到ー个通过精心设计与目标组织的网站非常相似的钓鱼网站上,然后获取用户在该钓鱼网站上输入的个人敏感信息,例如银行帐号、银行密码等。 [0004] The most typical phishing attack process is as follows: First, to lure them onto ー months through the website is designed with the target tissue is very similar to phishing sites, and then obtain sensitive personal information entered by the user on the phishing site, such as a bank account number, bank passwords. 通常这个攻击过程不会让受害者警觉。 This process does not usually attack victim alert. 这些个人信息对钓鱼网站持有者具有非常大的吸引力,通过使用窃取到的个人信息,他们可以假冒受害者进行欺诈性金融交易,获得极大的经济利益,而受害者们却因此而遭受到巨大的经济损失,非但如此,被窃取的个人信息还可能被用于其他非法活动。 The personal information has very attractive for holders of phishing sites, they can fake the victims of fraudulent financial transactions through the use of stolen personal information to obtain substantial economic benefits, and so they were victims suffer to huge economic losses, only that personal information stolen can also be used for other illegal activities. 如何识别钓鱼网站,如何保证网站信息传输的保密完整性,愈发的显示出其重要性和必要性。 How to identify phishing sites, how to ensure the integrity of the site confidential information transmission, increasingly shows its importance and necessity.

[0005] 大多数用户会受骗,很多时候是由于钓鱼网页总是与真实网页有高度的相似性。 [0005] Most users will be deceived, often due to phishing scams always have a high degree of similarity with the real page. 如果我们能从相似性的角度检测钓鱼网页,不失为ー个很好的方法。 If we can learn from the similarity of the angle detecting phishing scams, it would be a good way ー. 然而在钓鱼检测的过程中,除了钓鱼检测方法外,特征库的好坏也直接影响到检测的准确率,如何能找到钓鱼网页的目标网页,是本发明的研究重点。 However, in the course of fishing detection, in addition to phishing detection method, good or bad feature library also directly affect the accuracy of detection, how to find the target page phishing scams, it is the research focus of the present invention. Zhang在2007年提出CANTINA[Zhang2007],该方法通过借助第三方工具,比如搜索引擎,来检测钓鱼网页,它首先统计网页中词的TF-IDF (TF-IDF,词频-反文档频率,是ー种统计方法,用以评估一个字或词对于一个文件集或一个语料库中的其中一份文件的重要程度),把TF-IDF排序靠前的几个词条利用搜索引擎检索,如果该网页不出现在搜索结果的前面30个结果中,则认为是钓鱼网页。 Zhang put forward in 2007 CANTINA [Zhang2007], the method by using third-party tools, such as search engine, to detect phishing scams, it is first in terms of statistics page TF-IDF (TF-IDF, word frequency - Anti-document frequency, is ーkinds of statistical methods to assess the importance of a word or phrase for a set of files or a corpus which a document), the TF-IDF term use of a few higher-ranking search engines, if the page does not appear in front of the 30 results of the search results, is considered to be phishing scams. 该方法具有较高的精度和较小的误判率。 This method has a high accuracy and a smaller rate of false positives. 但是该方法只是基于网页内容的,对于ー种文字很少,图片很多的网页,或者是文字在图片中的网页将无能为力。 However, this method only web-based content for ー languages ​​rarely, picture a lot of pages, or pages of text in the picture will be powerless. 本发明将从文本和图片两种途径提取关键词,然后在多个搜索引擎上检索,综合多个搜索引擎的结果,最后再从图片角度利用图像感知哈技术查找最相似的目标网页。 The present invention is extracted from the text and pictures are two ways keyword, then retrieved on multiple search engines, the results of a comprehensive multiple search engines, and finally the use of image perception Ha techniques to find the most similar landing pages from the picture point of view.

[0006] [Zhang2007] Y. Zhang, J. Hong, and L. Cranor. Cantina: A content-basedapproach to detecting phishing websites. WWW, 2007. [0006] [Zhang2007] Y. Zhang, J. Hong, and L. Cranor Cantina:.. A content-basedapproach to detecting phishing websites WWW, 2007.

[Fu2006] Anthony Y. Fuj Wenyin Liuj Xiaotie Deng. Detecting PhishingWeb Pages with Visual similarity Assessment based on Earth Mover' s Distance(EMD). IEEE Transactions on Dependable and Secure Computing, 2006,3(4),pages301-311.[Dong2010]X. Dong, JA Clark, JL Jacob. Defending the weakest丄ink: phishing websites detection by analysing user behaviours. SpringerScience+Business Media, LLC 2010. [Fu2006] Anthony Y. Fuj Wenyin Liuj Xiaotie Deng. Detecting PhishingWeb Pages with Visual similarity Assessment based on Earth Mover 's Distance (EMD). IEEE Transactions on Dependable and Secure Computing, 2006,3 (4), pages301-311. [ Dong2010] X Dong, JA Clark, JL Jacob Defending the weakest Shang ink:... phishing websites detection by analysing user behaviours SpringerScience + Business Media, LLC 2010.

[Cao2009]Jiuxin Caoj Bo Maoj Junzhou Luoj and Bo Liu. A Phishing Web PagesDetection Algorithm Based on Nested Structure of Earth Mover' s Distance(Nested-EMD) · Chinese Journal of Computers. 2009,(05): 922-929. . [Cao2009] Jiuxin Caoj Bo Maoj Junzhou Luoj and Bo Liu A Phishing Web PagesDetection Algorithm Based on Nested Structure of Earth Mover 's Distance (Nested-EMD) · Chinese Journal of Computers 2009, (05):. 922-929.

[Chen2009]K.-T. Chen, J. -Y. Chen, C. -R. Huang, and C. -S. Chen. FightingPhishing with Discriminative Keypoint Features of Webpages. IEEE InternetComputing, 2009. [Chen2009] K.-T. Chen, J. -Y. Chen, C. -R. Huang, and C. -S. Chen. FightingPhishing with Discriminative Keypoint Features of Webpages. IEEE InternetComputing, 2009.

[Afroz2009]Sadia Afroz and Rachel Greenstadt. Phishzoo: An Automated WebPhishing Detection Approach Based on Profiling and Fuzzy Matching. TechnicalReport DU-CS-09-03,Drexel University, 2009. [Afroz2009] Sadia Afroz and Rachel Greenstadt Phishzoo:.. An Automated WebPhishing Detection Approach Based on Profiling and Fuzzy Matching TechnicalReport DU-CS-09-03, Drexel University, 2009.

[Henzinger2006]M. Henzinger. Finding near-duplicate Web pages: A丄arge—scale evaluation of algorithms. Proceedings of the Internationa丄ACMSIGIR Conference on Research and Development in Information Retrieval, 2006. . [Henzinger2006] M Henzinger Finding near-duplicate Web pages:.. A Shang arge-scale evaluation of algorithms Proceedings of the Internationa Shang ACMSIGIR Conference on Research and Development in Information Retrieval, 2006.

发明内容 SUMMARY

[0007] 技术问题:本发明提出的由钓鱼网页查找目标网页的方法,是ー种结合网页文本和图像特征,借用第三方工具和图像感知哈希技术来查找目标网页的方法。 [0007] Technical problem: Find a phishing website landing page method proposed by the present invention, a method is combined with webpage text and image features borrowed from third-party tools and image perception hashing techniques to find pages ー target species. 钓鱼者为了取得用户的信任,他们通常会模仿合法网页来构建钓鱼网页,所以ー个钓鱼网页与它的目标网页在视觉上是非常相似的,两者之间是有很好的关联性的。 Anglers in order to obtain the trust of users, they usually imitate legitimate website to build phishing scams, and phishing scams so ー one of its landing page is visually very similar, there is a good correlation between the two. 以往目标网页的检测是通过人工识别的,本发明提出的方法是从相似性角度查找目标网页,将更接近实际情况,同时可以保证钓鱼检测的高精度和低误判率。 Detecting the target web through the conventional artificial recognition, the method proposed by the present invention is to find a target page from the similarity aspects, will be closer to the actual situation, and can ensure high accuracy phishing detection rates and low false positives.

[0008] 技术方案:大多数用户会受骗,很多时候是由于钓鱼网页与真实网页有高度的相似性。 [0008] Technical solutions: Most users will be deceived, often due to phishing web pages with real high degree of similarity. 如果我们能从相似性的角度检测钓鱼网页,不失为ー个很好的方法。 If we can learn from the similarity of the angle detecting phishing scams, it would be a good way ー. 然而在钓鱼检测的过程中,除了钓鱼检测方法外,特征库的好坏也直接影响到检测的准确率,本发明的研究重点就是如何能找到钓鱼网页的目标网页。 However, in the course of fishing detection, in addition to phishing detection method, good or bad feature library also directly affect the accuracy of detection, the research focus of this invention is how to find a landing page phishing pages. 如果能找到与其最相近的目标网页,那么如果再次遇到该目标网页的钓鱼网页,就能很好地被检测出来,提高检测的准确率。 If you can find its closest landing page, so if you encounter phishing scams that target web page again, you can very well be detected, improve the accuracy of detection.

[0009] 由钓鱼网页查找目标网页的方法,首先从网页标题、主体和网页图片中提取关键词,组成该钓鱼网页的词汇签名;然后在多个搜索引擎上用词汇签名进行检索,综合这些搜索引擎的結果,找出最相近的前K个网页,K为整数;将这K个网页和钓鱼网页以图片形式保存,提取图像感知哈希序列,最后分别计算这K个网页图片与钓鱼网页图片之间海明距离,根据距离的大小可以选出该钓鱼网页的一个或者多个目标网页。 [0009] Finding a phishing website landing page approach, first of all, the main picture and web page title keywords extracted from the composition of the fishing vocabulary signature pages; and then the search terms a signature on multiple search engines, these comprehensive search the results engine to find the closest front pages K, K is an integer; K these web pages and fishing in picture form save, extract the image perception hash sequence, and finally calculate the K and phishing web images web images Hamming distance between, the distance can be selected according to the size of the fishing page or a plurality of landing pages.

[0010] 该方法主要包括词汇签名的生成部分、多个搜索引擎检索部分、图像感知哈希序列的生成及匹配部分。 [0010] The method includes generating a signature word portion, a plurality of search engines section, image generating section and perceptual matching hash sequence.

[0011] 词汇签名的生成部分需要的步骤如下: [0011] Step desired signature word generating section is as follows:

步骤11)分别从网页标题和主体中提取纯文本文字; Step 11) were extracted from the plain text page title and text body;

步骤12)获取网页中的图片,通过光学字符识别OCR技术提取出嵌在图片中的文字; 步骤13)综合网页标题、主体和图片中的文字,计算这些文字的词频-反文档频率TF-IDF值,由前5个最高TF-IDF的词构成ー个词汇签名; Step 12) acquired page images extracted embedded in text in the image by optical character recognition OCR technology; step 13) integrated page title, body, and text in the image, calculating the text term frequency - inverse document frequency TF-IDF value, by the word before the five highest TF-IDF constitute ー vocabulary signature;

多个搜索引擎检索部分需要的步骤如下: A plurality of search engines is the following partial steps:

步骤21)将生成的词汇签名分别在N个搜索引擎上进行检索,N为整数; Step 21) the generated signature words are retrieved in a search engine on the N, N being an integer;

步骤22)找出至少出现在两个搜索引擎结果中的网页,组成一个网页列表; Step 22) to find the pages appear in at least two search engine results, consisting of a list of pages;

步骤23)由公式1、2、3计算网页列表中各个网页的相关度; Step 23) is calculated by the equation 2, 3 of each page in the list of pages correlation;

Figure CN102629261AD00061

其中,uU表示第i个搜索引擎的检索结果中排名为j的网址,1 = 1,2,......,N, Wherein, uU search result represents the i-th ranked search engine URL j, 1 = 1,2, ......, N,

J = U,…,Nr, N与凡均为整数; J = U, ..., Nr, N are integers and where;

Figure CN102629261AD00062

其中,表示第i个搜索引擎中的排名为j的相关度表示ー个搜索引擎所取的搜索结果总数Λν表示第i个搜索引擎中的第j个结果的排名为],RU=j ;Uy表示第i个搜索引擎的检索结果中排名为j的网址,如果只在ー个搜索引擎中出现,那么^^ = 0 ;UP表示至少出现在两个搜索引擎结果中的网址,P = 1,2,......,M,M为整数且M <N*Nr; Wherein, denotes the i th search engine rankings represented ー relevant degree j of the search engine taken search results Total Λν represents Rank i search engines in the j-th result is], RU = j; Uy search result represents the i-th ranked search engine URL j, if only in the search engines ー, then ^^ = 0; UP indicates URLs occurs in at least two search engine results, P = 1, 2, ......, M, M is an integer and M <N * Nr;

Figure CN102629261AD00063

其中,^表示^?在N个搜索引擎中的相关度之和;%表示至少出现在两个搜索引擎结果中的网址,P = 〗ス,......,Μ,M为整数且M <N*Nr Pij表示第i个搜索引擎的搜索结果中排名 Wherein ^ ^ represents the N search engines correlation sum;?% Represents at least two appear in the search engine results in the URL, P =〗 su, ......, [mu], M is an integer and M <N * Nr Pij denotes the i-th search results of the search engine rankings

为j的网址,如果A1/只在ー个搜索引擎中出现,那么Ev = 0 表示第i个搜索引擎中的排名为j的相关度,N与~均为整数; J is a URL, if the A1 / ー search engines only occurred, then Ev = 0 indicates the i-th ranked search engines for the correlation j, N and ~ are integers;

步骤24)由公式3和4计算出前K个相关度高的网页,认为这K个网页与该钓鱼网页最相关,作为该钓鱼网页的候选目标网页,K为不大于N3icNr的整数; Step 24) is calculated by Equation 3 and 4 of the first K highly relevant pages that this page K most relevant to that page fishing, as the phishing web page candidate target, K is an integer of not more than N3icNr;

Figure CN102629261AD00064

其中,忍P表示Up在N个搜索引擎中的排名之和;up表示至少出现在两个搜索引擎结果中的网页,P = 1,2,......,M,M〈N本Nr ;uy表示第i个搜索引擎的搜索结果中排名为j的网 Wherein P represents tolerance Up in N search engine rankings and; represents up occurs in at least two search engine results pages, P = 1,2, ......, M, M <N the present nr; uy represents the i-th search results in the search engine rankings for the net j

址,5 y表示第i个搜索引擎中的第j个结果的排名为 Site, 5 y i represents the ranked search engine results for the j-th

图像感知哈希序列的生成及匹配部分需要的步骤如下: And the step of generating an image matching portion need perceived hash sequences as follows:

步骤31)对图片进行规格化处理,将图片统ー变为具有255阶的灰度图像, Step 31) is normalized on the image, the image having a gray-scale image system 255 becomes ー order,

并用双线性插值的方法将分辨率统ー变为m*m,m为8的整数倍;步骤32)将m*m的图片分成8*8的小块; And using bilinear interpolation method ー resolution system becomes m * m, m is an integer multiple of 8; step 32) of m * m image into small 8 * 8;

步骤33)对每ー小块进行离散余弦变换,对于每ー小块,保留I个直流分量, Step 33) performs discrete cosine transform on each ー tile, for each tile ー retain a DC component I,

9个交流分量,其余的将其置为O ; 9 AC component, which is the remaining set is O;

步骤34)用视觉模型对新生成的离散余弦系数矩阵进行处理,去掉信息中的冗余数据,来提高图像压缩的效率; Step 34) for processing the discrete cosine coefficient matrix of the newly generated models visually, remove redundant data information, to improve the efficiency of image compression;

步骤35)用逻辑斯谪Logistic方程作为混沌序列发生器进行加密,由一个密钥生成ー个加密矩阵,用此矩阵对离散余弦变换系数矩阵进行加密; Step 35) performed by logistic equation banished Logistic chaotic sequence generator as encryption, a key generated by encryption ー matrix, this matrix using a discrete cosine transform coefficient matrix is ​​encrypted;

步骤36)将得到的浮点型数据通过量化处理变为ニ值数据,減少冗余; Step 36 floating-point data) obtained by the quantization value data is converted into the Ni, reduce redundancy;

步骤37)用哈夫曼压缩编码进行压缩编码,得到最終的哈希序列; Step 37) is compressively coded by Huffman coding, the resulting final hash sequence;

步骤38)分别计算钓鱼网页图片的哈希序列和这K个候选网页图片的哈希序列之间的海明距离,选择距离最小的前L个网页为该钓鱼网页模仿的合法网页,L为不大于K的整数。 Step 38) calculates the Hamming distance between the fishing hash page image sequence and sequences of K hash page candidate image selecting legitimate page mimic minimum distance L before the web page for fishing, not L K is an integer greater than.

[0012] 有益效果:本发明方法综合了第三方工具和图像感知哈希技术,结合网页在文本和图像上的相似度查找钓鱼网页对应的目标网页。 [0012] beneficial effects: the method of the invention combines third-party tools and image perception hashing technology, combined with text and images on the pages to find the similarity of phishing scams target page corresponding. 通过使用本发明的方法收集和更新钓鱼检测所需要的白名单,有助于提高钓鱼网页检测的准确率。 Collect and update whitelist phishing detection by the method of the present invention is required to help improve the accuracy of detection of phishing scams.

[0013] [0013]

附图说明 BRIEF DESCRIPTION

[0014] 图I是本发明方法的整体框架图, [0014] Figure I is an overall diagram of a method of the present invention, the frame,

图2是图像感知哈希序列生成流程图。 FIG 2 is a flowchart of an image sensing hash sequence generator.

[0015] [0015]

具体实施方式 detailed description

[0016] 本发明的目的是提供一种由钓鱼网页查找目标网页的方法,首先从已知钓鱼网页中提取关键词,组成ー个词汇签名;其次将词汇签名在多个搜索引擎上进行检索,综合多个搜索引擎的結果,选出最相关的几个作为候选网页;然后将候选网页以图片形式保存,提取图像感知哈希序列,计算这些网页图片与钓鱼网页图片之间的海明距离,根据距离的大小可以选出该钓鱼网页的一个或者多个目标网页。 [0016] The object of the present invention is to provide a method for finding the target of phishing web page, keywords are extracted from the first known phishing web page composition ー vocabulary signature; secondly retrieving lexical signature on multiple search engines, more comprehensive search engine results, select the most relevant page number as a candidate; the candidate then save web pages as images, extracting image perception hash sequence, calculate the Hamming distance between these pages with pictures phishing web images, Depending on the size of the distance can select one or more target phishing web pages.

[0017] 由钓鱼网页查找其目标网页的方法需要以下步骤: [0017] Finding a phishing website landing page approach requires the following steps:

步骤I)分别从网页标题、主体和网页图片中提取文本文字,综合这些文字,然后计算这些文字的词频-反文档频率TF-IDF值,由前5个最高TF-IDF的词构成ー个词汇签名;步骤2)用步骤I)生成的词汇签名分别在N个搜索引擎:谷歌、雅虎等上进行检索;步骤3)取出每个搜索引擎前和个搜索结果,组成网页列表,分别计算列表中网页的相关度; Step I) were extracted from the page title, body, and pages pictures of text characters, combination of these words, and then calculate the word frequency of these words - Anti document frequency TF-IDF value, consists of the word before the five highest TF-IDF's ー vocabulary signature; step 2) generated by a step I) word signature in the N search engine: searching on Google, Yahoo; step 3) taken before each search engine and the search results, the composition of the list of pages, were calculated list relevance of the page;

步骤4)选出相关度高的前K个网页,将该钓鱼网页和选出的K个网页以图片形式保 Step 4) elected former K highly relevant pages, the phishing scams and elected Paul K pages as images

存; Deposit;

步骤5)通过图像感知哈希技术提取各个图片的哈希序列; Extraction hash sequence of individual images in step 5) by the image sensing hashing technique;

步骤6)计算这K个网页图片的哈希序列与钓鱼网页图片对应的哈希序列之间的海明距离,选择距离最小的前个网页为该钓鱼网页的目标网页。 Step 6) the Hamming distance between the sequence of the hash calculation of the K image hash page sequence page image corresponding to fishing, fishing for selecting a target page from the page before the page is the smallest. [0018] 由钓鱼网页查找目标网页的方法的整体框架,见图I。 [0018] Finding the overall framework of phishing scams target page method, shown in Figure I. 本方法可以分为三大部分:词汇签名的生成,多个搜索引擎的检索和图像感知哈希序列的生成及匹配。 This method can be divided into three parts: the generated signature words, multiple search engines perception and image retrieval and matching hash generation sequence.

[0019] I.词汇签名的生成部分 [0019] I. Glossary signature generation part

本发明的词汇签名是由网页中具有比较高TF-IDF的关键词组成的,关键词有三个来源:ー是网页标题的纯文本文字;ニ是网页主体中的纯文本文字;三是嵌在网页图片中的文字。 Signature word present invention is a relatively high TF-IDF composed of keywords in the web page, keywords from three sources: ー plain text of the text page title; ni plain text in the text body of the page; is embedded in three web page text in the image. 后者通过光学字符识别技术可以提取出嵌在图片中的文字。 The latter can be extracted embedded in text in the image by optical character recognition technology. 从这三个角度提取出来的关键词,可以减小由于有些网页纯文本内容偏少、图片偏多或者是文字嵌在图片中导致的误差。 From the perspective of three extracted keywords, some pages may be reduced due to the plain text content less than normal, above normal picture or text embedded in the picture caused by the error. [0020] 具体需要如下步骤: [0020] DETAILED requires the following steps:

步骤11)分别从网页标题和主体中提取纯文本文字; Step 11) were extracted from the plain text page title and text body;

步骤12)获取网页中的图片,通过光学字符识别技术提取出嵌在图片中的文字; Step 12) acquired page images embedded in the extracted text in the image by optical character recognition technique;

步骤13)综合网页标题、主体和图片中的文字,计算这些文字的词频-反文档频率TF-IDF值,由前5个最高TF-IDF的词构成ー个词汇签名。 Step 13) Comprehensive page title, body text and pictures, word frequency calculating these characters - Anti document frequency TF-IDF values, constitute ー vocabulary words from the top 5 highest TF-IDF signature.

[0021] 2.多个搜索引擎的检索部分 [0021] 2. A plurality of portions of the search engine to retrieve

将钓鱼网页生成的词汇签名分别在N个搜索引擎:谷歌、雅虎等上进行检索。 The phishing scams generated signature words, in the N search engine: a search on Google, Yahoo and so on. 取出每个搜索引擎的前M个结果,用公式I表示。 M results taken before each search engine, represented by formula I.

[0022] [0022]

Figure CN102629261AD00081

其中,1Hj表示第i个搜索引擎的检索结果中排名为j的网址,1=1,2,......,N, Wherein, 1Hj search result represents the i-th ranked search engine URL j, 1 = 1,2, ......, N,

J=U,......,Nr, N 与V均为整数。 J = U, ......, Nr, N and V are integers.

[0023] 找出至少出现在两个搜索引擎结果中的网页,标号为Up, P = Iス......,M, [0023] identify pages appear in at least two search engine results, designated Up, P = I su ......, M,

M < Ν*1..ΤΓ。 M <Ν * 1..ΤΓ.

[0024] 按照公式2计算相关度 [0024] The degree of correlation calculated according to Formula 2

Figure CN102629261AD00082

其中n表示第i个搜索引擎中的排名为j的相关度表示ー个搜索引擎所取的搜索结果总数;ん·表示第i个搜索引擎中的第j个结果的排名为]Aj = J ;Uy表示第i个搜索引擎的检索结果中排名为j的网址,1=1,2,......,N,户I么......具』与%均为整数,如果 Wherein n represents the i-th search engine ranking of correlation j represents ー total number of results search engine taken; san * represents Rank i search engines in the j-th result is] Aj = J; Uy represent the i-th search results in the search engine rankings for the j URLs, 1 = 1,2, ......, N, families ...... with what i "and% are integers, if

«W只在ー个搜索引擎中出现,那么Ev = 0 Λ»ρ表示至少出现在两个搜索引擎结果中的网址,P ニ1,2,......,Μ, M 为整数且M <N*Nr。 «ー W is only in the search engines, then Ev = 0 Λ» ρ represents at least two appear in the search engine results in the URL, P ni 1,2, ......, Μ, M is an integer and M <N * Nr.

[0025] 根据公式3计算K11在这N个搜索引擎中的相关度之和 [0025] According to Equation 3 calculates correlation degrees and K11 in the N search engine

[0026] [0026]

Figure CN102629261AD00083

其中,》^表示第i个搜索引擎中的排名为j的相关度,N与A均为整数;Uy表示第i个搜索引擎的搜索结果中排名为j的网址,I=IA......,N,J = I,2,......,Nr,N与が;均为整数;up表 Wherein, "^ denotes the i-th ranked search engines correlation of j, N and A are both integers; Uy denotes the i-th search results of a search engine to rank j URL, I = IA .... .., N, J = I, 2, ......, Nr, N and ga; are integers; up table

示至少出现在两个搜索引擎结果中的网址,P = 1,2,......,M, M为整数且M < N*Nro Illustrating least two search engine results in the URL, P = 1,2, ......, M, M is an integer and M <N * Nro

[0027] 如果有多个网页日相等或者多个网页的同时达到最大,根据公式4分别计算这些网页在搜索引擎中的排名之和^^。 [0027] If there are multiple pages equal to or more days while the maximum page, according to Equation 4 are calculated, and these pages ^^ rankings in search engines.

[0028] [0028]

Figure CN102629261AD00091

其中,iIij•表示第i个搜索引擎中的第j个结果的排名为hRiJ=J Py表示第i个搜索引擎的搜索结果中排名为j的网址,1=1,2,......,N,J = IA......具,N与A均为整数;up表示至 Which, iIij • i represents the ranked search engine results for the j-th hRiJ = J Py denotes the i th search engine ranking of search results for the URL j, 1 = 1,2, ..... ., N, J = IA ...... tools, N and A are both integers; represents up to

少出现在两个搜索引擎结果中的网页,P = 1,2,......,M,M < N^Nro At least two appear in the search engine results pages, P = 1,2, ......, M, M <N ^ Nro

[0029] 根据公式3和公式4选择前K个相关度最高的网页,将这K个网页以图片形式保存,井分别提取这些网页图片与钓鱼网页图片的哈希序列。 [0029] According to the front page of the highest Equation 3 and Equation 4 selects K correlation, K these pages saved in the form of pictures, hash sequences were extracted from these wells Web Images and pictures of phishing scams.

[0030] 多个搜索引擎检索部分具体需要如下步骤: [0030] a plurality of search engines require some of the specific steps of:

步骤21)将生成的词汇签名分别在N个搜索引擎:谷歌、雅虎等上进行检索; Step 21) to generate vocabulary signature in the N search engine: a search on Google, Yahoo and so on;

步骤22)找出至少出现在两个搜索引擎结果中的网页,组成一个网页列表; Step 22) to find the pages appear in at least two search engine results, consisting of a list of pages;

步骤23)由公式1、2、3计算网页列表中各个网页的相关度; Step 23) is calculated by the equation 2, 3 of each page in the list of pages correlation;

步骤24)由公式3和公式4找出前K个相度高的网页,将这K个网页作为该钓鱼网页的候选目标网页。 Step 24) Equation 3 and Equation 4 to find the first K phase high pages, these pages from the K candidate target page as the phishing web page.

[0031] 3.基于图像感知哈希技术的匹配部分 [0031] 3. Based on the image sensing portion matching hashing

图2是图像感知哈希序列生成流程图,下面简单介绍流程图中各个模块的工作。 FIG 2 is a flowchart of an image sensing hash sequence generation, the following brief flow chart of each module.

[0032] 首先将网页以图片形式保存,然后对图片进行規格化处理,将所有图片统ー变为具有255阶的灰度图像,并用双线性插值的方法将分辨率统ー变为m*m,m 一般选择8的整数倍,目的是使得最后生成的哈希序列长度统一。 [0032] The first page stored in the form of pictures, then the picture is normalized, all the pictures system having 255 gray scale image becomes ー order, and bilinear interpolation method resolution system becomes ー m * m, m typically selected integer multiple of 8, so that the object is the last hash uniform length sequence generation.

[0033] 对图像进行离散余弦变换的过程是:首先将m*m的图像分成8*8的小块,对每一小块进行离散余弦变换,最后对于每ー小块,保留I个直流分量DC系数,9个交流分量AC系数,其余的将其置为O。 [0033] The process of the discrete cosine transform of an image is: the image is first divided into m * m * 8 8 pieces, each small discrete cosine transform, and finally, for each tile ー retain a DC component I DC coefficients, AC coefficients AC component 9, which is the remaining set is O. 然后用视觉模型对新生成的离散余弦变换系数矩阵进行处理,能很好地去掉信息中的冗余数据,提高图像压缩的效率。 Followed by visual model of the discrete cosine transform coefficient matrix newly generated process, can well remove redundant data information, to improve image compression efficiency.

[0034] 加密处理就是对矩阵进行标准化处理,根据混沌区数据的迭代不重复性和初值敏感性用逻辑斯蒂Logistic方程作为混沌序列发生器进行加密,由一个密钥生成一个加密矩阵,用此矩阵对离散余弦变换系数矩阵进行加密,保证哈希函数的安全性。 [0034] The encryption processing is performed on the normalized matrix, encrypted according to a chaotic sequence generator Iterative initial sensitivity and reproducibility with logistic equation Logistic chaotic region data, to generate a matrix of a cryptographic key, with this matrix of discrete cosine transform coefficient matrix encryption to ensure the security of the hash function.

[0035] 通过量化处理可以将浮点型数据变为ニ值数据,減少冗余,便于存储。 [0035] The floating-point data can be changed by the quantized data values ​​ni, reduce redundant easy storage. 最后用哈夫曼压缩编码进行压缩,得到最終的哈希序列。 Finally, Huffman coding compression compressed to give the final hash sequence.

[0036] 得到哈希序列后,用海明码距离计算公式5进行图像匹配,设hi和h2为两个哈希序列,L为哈希序列的长度,则 [0036] After obtaining the hash sequence calculated using Equation 5 from the Hamming code image matching, hi and h2 are provided two hash sequences, L is the length of the hash sequence, the

Figure CN102629261AD00092

分别计算钓鱼网页图片的哈希序列与这K个网页图片对应的哈希序列的海明距离,选择距离最小的前L个网页为该钓鱼网页的目标网页,Τ> I. >1。 Were calculated hash fishing page image sequence this Hamming distance K corresponding to the hash page image sequence, for selecting a target page phishing scams minimum distance L front pages, Τ> I.> 1. [0037] 图像感知哈希序列的生及匹配部分具体需要如下步骤: [0037] a hash perceive the image sequence and the matching portion of the specific health needs the following steps:

步骤31)对图片进行规格化处理,将图片统ー变为具有255阶的灰度图像,并用双线性插值的方法将分辨率统ー变为m*m,m为8的整数倍; Step 31) is normalized on the image, the image becomes a gray image having ー system 255 order, and bilinear interpolation method ー resolution system becomes m * m, m is an integer multiple of 8;

步骤32)将m*m的图片分成8*8的小块; 步骤33)对每ー小块进行离散余弦变换,对于每ー小块,保留I个直流分量,9个交流分量,其余的将其置为O; Step 32) of m * m image into small 8 * 8; step 33) for each of the discrete cosine transform ー tile, for each tile ー, I reserved a DC component, an AC component 9, the remaining which set is O;

步骤34)用视觉模型对新生成的离散余弦系数矩阵进行处理,去掉信息中的冗余数据,来提高图像压缩的效率; Step 34) for processing the discrete cosine coefficient matrix of the newly generated models visually, remove redundant data information, to improve the efficiency of image compression;

步骤35)用逻辑斯蒂Logistic方程作为混沌序列发生器进行加密,由一个密钥生成ー个加密矩阵,用此矩阵对离散余弦变换系数矩阵进行加密; Step 35) using logistic equation Logistic chaotic sequence generator as encrypted, a key generated by encryption ー matrix, this matrix using a discrete cosine transform coefficient matrix is ​​encrypted;

步骤36)将得到的浮点型数据通过量化处理变为ニ值数据,減少冗余; Step 36 floating-point data) obtained by the quantization value data is converted into the Ni, reduce redundancy;

步骤37)用哈夫曼压缩编码进行压缩编码,得到最終的哈希序列; Step 37) is compressively coded by Huffman coding, the resulting final hash sequence;

步骤38)分别计算钓鱼网页图片的哈希序列和这K个候选网页图片的哈希序列之间的海明距离,选择距离最小的前L个网页为该钓鱼网页模仿的合法网页。 Step 38) calculates the Hamming distance between the fishing hash page image sequence and sequences of K hash page candidate image selecting legitimate page for phishing scams imitate minimum distance L front pages.

Claims (1)

1. 一种由钓鱼网页查找目标网页的方法,其特征在于首先从网页标题、主体和网页图片中提取关键词,组成该钓鱼网页的词汇签名;然后在多个搜索引擎上用词汇签名进行检索,综合这些搜索引擎的结果,找出最相近的前K个网页,K为整数;将这K个网页和钓鱼网页以图片形式保存,提取图像感知哈希序列,最后分别计算这K个网页图片与钓鱼网页图片之间海明距离,根据距离的大小选出该钓鱼网页的一个或者多个目标网页; 该方法主要包括词汇签名的生成部分、多个搜索引擎检索部分、图像感知哈希序列的生成及匹配部分; 词汇签名的生成部分需要的步骤如下: 步骤11)分别从网页标题和主体中提取纯文本文字; 步骤12)获取网页中的图片,通过光学字符识别OCR技术提取出嵌在图片中的文字; 步骤13)综合网页标题、主体和图片中的文字,计算这些文 A phishing web page to find a landing page method, characterized in that first extracts keywords from the page title, body, and page pictures, the composition of the fishing vocabulary signature pages; and then the search terms a signature on multiple search engines comprehensive results of these search engines to find the closest front pages K, K is an integer; K these web pages and fishing in picture form save, extract the image perception hash sequence, and finally calculate the K web images Hamming distance between phishing scams picture, select one or more target web page according to the size of the fishing distance; the method includes generating a signature part of the vocabulary, the more search engines part, image perception hash sequence generation and the matching portion; the step of generating portion vocabulary signature is the following: step 11) were extracted plaintext text from the page title and body; step 12) acquired page images extracted embedded in the image by optical character recognition OCR technology text; step 13) comprehensive page title, body text and pictures to calculate these documents 的词频-反文档频率TF-IDF值,由前5个最高TF-IDF的词构成一个词汇签名; 多个搜索引擎检索部分需要的步骤如下: 步骤21)将生成的词汇签名分别在N个搜索引擎上进行检索,N为整数; 步骤22)找出至少出现在两个搜索引擎结果中的网页,组成一个网页列表; 步骤23)由公式1、2、3计算网页列表中各个网页的相关度; U2,l ·■ ■■ UN,I Ii·\ n tln 产\ ■■ ■· ·■ _■ mm μ · (I) ■ ■ m ■ ■ ■ ■ « m ■ _Ul,Nr U2,Nr ■■ ■· UN,Nr _ 其中,uU表示第i个搜索引擎的检索结果中排名为j的网址,1=1,2,......,N,J=U,......,Nr, N 与Ff.均为整数; 'N-(Ri ,-I) f 1 Wu = ^^^〜……,构 ⑵ O eke 其中,表示第i个搜索引擎中的排名为j的相关度I表示一个搜索引擎所取的搜索结果总数Aij表示第i个搜索引擎中的第j个结果的排名为j鳥=J ;Uy表示第i个搜索引擎的检索结果中排名为j的网址,如果1^• Term frequency - inverse document frequency TF-IDF value, the previous word from the five highest TF-IDF constituting a signature word; step portion of a plurality of search engines is the following: Step 21) the generated signature word search in the N retrieval performed on the engine, N is an integer; step 22) occurs in at least two pages to find the search engine results, the composition of a list of pages; step 23) the page list affinity of each page calculated by the formula 2,3 ; U2, l · ■ ■■ UN, I Ii · \ n tln yield \ ■■ ■ · · ■ _ ■ mm μ · (I) ■ ■ m ■ ■ ■ ■ «m ■ _Ul, Nr U2, Nr ■■ ■ · UN, Nr _ wherein, uU search result represents the i-th ranked search engine URL j, 1 = 1,2, ......, N, J = U, ...... ., Nr, N and Ff are integers; 'N- (Ri, -I) f 1 Wu = ^^^ ~ ......, wherein ⑵ O eke configuration, denotes the i th search engine rankings related to j of I represents the total number of search results for a search engine, taken Aij represents the rank of the i-th search engines j-th result is j birds = J; Uy denotes the search result i-th search engine rankings j, URL, If 1 ^ • 只在一个搜索引擎中出现,那么= 0 ;UP表示至少出现在两个搜索引擎结果中的网址,P = 1,2,......,M,M为整数且M < N*Nr ; H Hs = ΣΣ ,¾ ο)PW j-1 K0J Kj-uPi P = 1,2,......,Μ 其中,5V表示〜在N个搜索引擎中的相关度之和;%表示至少出现在两个搜索引擎结果中的网址,P = IA......M为整数且M <N*Nr 表示第i个搜索引擎的搜索结果中排名为j的网址,如果只在一个搜索引擎中出现,那么= O 表示第i个搜索引擎中的排名为j的相关度,N与AT,均为整数; 步骤24)由公式3和4计算出前K个相关度高的网页,认为这K个网页与该钓鱼网页最相关,作为该钓鱼网页的候选目标网页,K为不大于的整数; Only in a search engine, then = 0; UP indicates URLs occurs in at least two search engine results, P = 1,2, ......, M, M is an integer and M <N * Nr ; H Hs = ΣΣ, ¾ ο) PW j-1 K0J Kj-uPi P = 1,2, ......, Μ wherein, 5V ~ represents the sum of the N correlation search engine; represents% at least two appear in the search engine results in the URL, P = IA ...... M are integers and M <N * Nr denotes the i-th search results of a search engine to rank j URL, if only a appears in the search engine, then = O denotes the i th search engine ranking is the correlation j, N and the AT, are integers; step 24) before K correlation is calculated with high pages by equation 3 and 4, that this K web pages most relevant to the phishing scams, phishing scams as the candidate landing pages, K is not greater than the integer;
Figure CN102629261AC00031
(4) 其中,Mp表示\在N个搜索引擎中的排名之和;up表示至少出现在两个搜索引擎结果中的网页,P = 1,2,.......,M,M <N*Nr ;uy表示第i个搜索引擎的搜索结果中排名为j的网址,\,.表示第i个搜索引擎中的第j个结果的排名为hRij=h 图像感知哈希序列的生成及匹配部分需要的步骤如下: 步骤31)对图片进行规格化处理,将图片统一变为具有255阶的灰度图像, 并用双线性插值的方法将分辨率统一变为m*m,m为8的整数倍; 步骤32)将m*m的图片分成8*8的小块; 步骤33)对每一小块进行离散余弦变换,对于每一小块,保留I个直流分量, 9个交流分量,其余的将其置为O ; 步骤34)用视觉模型对新生成的离散余弦系数矩阵进行处理,去掉信息中的冗余数据,来提高图像压缩的效率; 步骤35)用逻辑斯谪Logistic方程作为混沌序列发生器进行加密,由一个密钥生成一个加密矩阵,用此矩阵对 (4) where, Mp represents \ N in the search engine rankings and; represents up occurs in at least two search engine results pages, P = 1,2, ......., M, M <N * Nr; uy represents the i-th search results of a search engine to rank j URL \ i ,. represents ranked search engine results for the j-th generation hash hRij = h perceive the image sequence and the step of matching section is the following: step 31) is normalized on the image, the image having a gray scale image 255 becomes uniform order, and bilinear interpolation method for the resolution becomes uniform m * m, m is an integer multiple of 8; step 32) of m * m image into small 8 * 8; step 33) for discrete cosine transform for each tile, for each tile, a DC component I reserved, exchange 9 component, which is the remaining set is O; step 34) for generating discrete cosine coefficient matrix is ​​treated with the new vision model, remove redundant data information, to improve image compression efficiency; step 35) using logistic banished Logistic a chaotic sequence generator equation is encrypted, a cryptographic key is generated by a matrix, this matrix with 散余弦变换系数矩阵进行加密; 步骤36)将得到的浮点型数据通过量化处理变为二值数据,减少冗余; 步骤37)用哈夫曼压缩编码进行压缩编码,得到最终的哈希序列; 步骤38)分别计算钓鱼网页图片的哈希序列和这K个候选网页图片的哈希序列之间的海明距离,选择距离最小的前L个网页为该钓鱼网页模仿的合法网页,L为不大于K的整数。 Discrete cosine transform coefficient matrix is ​​encrypted; step 36) the floating-point data obtained through the quantization is converted into the binary data, reduce redundancy; step 37) using Huffman coding compression coding, the resulting final hash sequence ; step 38) calculates the Hamming distance between the fishing hash page image sequence and the sequence of the K candidates hash page images, select the smallest distance L before legitimate web pages for phishing scams imitate, as L K is an integer not greater than.
CN 201210051171 2012-03-01 2012-03-01 Method for finding landing page from phishing page CN102629261B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201210051171 CN102629261B (en) 2012-03-01 2012-03-01 Method for finding landing page from phishing page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201210051171 CN102629261B (en) 2012-03-01 2012-03-01 Method for finding landing page from phishing page

Publications (2)

Publication Number Publication Date
CN102629261A true true CN102629261A (en) 2012-08-08
CN102629261B CN102629261B (en) 2014-07-16

Family

ID=46587521

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201210051171 CN102629261B (en) 2012-03-01 2012-03-01 Method for finding landing page from phishing page

Country Status (1)

Country Link
CN (1) CN102629261B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103246701A (en) * 2013-04-02 2013-08-14 百度在线网络技术(北京)有限公司 Quick search method, system and device
CN103412960A (en) * 2013-08-31 2013-11-27 西安电子科技大学 Image perceptual hashing method based on two-sided random projection
CN103729354A (en) * 2012-10-10 2014-04-16 腾讯科技(深圳)有限公司 Webpage information processing method and device
CN104079559A (en) * 2014-06-05 2014-10-01 腾讯科技(深圳)有限公司 Web address security detecting method and device and server
CN104717072A (en) * 2015-03-10 2015-06-17 南京师范大学 Remote-sensing image authentication method based on perceptual hash and elliptic curve
CN104079559B (en) * 2014-06-05 2017-07-25 腾讯科技(深圳)有限公司 A URL-security detection method, apparatus and server

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004068288A2 (en) * 2003-01-24 2004-08-12 America Online Inc. Classifier Tuning Based On Data Similarities
US20050060643A1 (en) * 2003-08-25 2005-03-17 Miavia, Inc. Document similarity detection and classification system
CN102098235A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing mail inspection method based on text characteristic analysis
CN102096781A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing detection method based on webpage relevance

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004068288A2 (en) * 2003-01-24 2004-08-12 America Online Inc. Classifier Tuning Based On Data Similarities
US20050060643A1 (en) * 2003-08-25 2005-03-17 Miavia, Inc. Document similarity detection and classification system
CN102098235A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing mail inspection method based on text characteristic analysis
CN102096781A (en) * 2011-01-18 2011-06-15 南京邮电大学 Fishing detection method based on webpage relevance

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103729354A (en) * 2012-10-10 2014-04-16 腾讯科技(深圳)有限公司 Webpage information processing method and device
CN103729354B (en) * 2012-10-10 2015-10-21 腾讯科技(深圳)有限公司 Web page information processing method and apparatus
CN103246701A (en) * 2013-04-02 2013-08-14 百度在线网络技术(北京)有限公司 Quick search method, system and device
CN103412960A (en) * 2013-08-31 2013-11-27 西安电子科技大学 Image perceptual hashing method based on two-sided random projection
CN103412960B (en) * 2013-08-31 2016-08-10 西安电子科技大学 Based on the projected image perception bilateral random hashing method
CN104079559A (en) * 2014-06-05 2014-10-01 腾讯科技(深圳)有限公司 Web address security detecting method and device and server
CN104079559B (en) * 2014-06-05 2017-07-25 腾讯科技(深圳)有限公司 A URL-security detection method, apparatus and server
CN104717072A (en) * 2015-03-10 2015-06-17 南京师范大学 Remote-sensing image authentication method based on perceptual hash and elliptic curve

Also Published As

Publication number Publication date Type
CN102629261B (en) 2014-07-16 grant

Similar Documents

Publication Publication Date Title
Hoffart et al. Robust disambiguation of named entities in text
Egozi et al. Concept-based information retrieval using explicit semantic analysis
Chum et al. Total recall: Automatic query expansion with a generative feature model for object retrieval
Fu et al. Detecting phishing web pages with visual similarity assessment based on earth mover's distance (EMD)
Yang et al. Supervised reranking for web image search
Turcot et al. Better matching with fewer features: The selection of useful features in large database recognition problems
US8429173B1 (en) Method, system, and computer readable medium for identifying result images based on an image query
US20100241647A1 (en) Context-Aware Query Recommendations
Xu et al. Tag refinement by regularized LDA
Yu et al. Click prediction for web image reranking using multimodal sparse coding
Yu et al. Learning to rank using user clicks and visual features for image retrieval
Pereira et al. Using web information for author name disambiguation
US20100125568A1 (en) Dynamic feature weighting
Lienhart et al. Multilayer pLSA for multimodal image retrieval
Wenyin et al. Detection of phishing webpages based on visual similarity
US20100185691A1 (en) Scalable semi-structured named entity detection
US20110219012A1 (en) Learning Element Weighting for Similarity Measures
US20120158621A1 (en) Structured cross-lingual relevance feedback for enhancing search results
Li et al. Video search in concept subspace: a text-like paradigm
CN101290626A (en) Text categorization feature selection and weight computation method based on field knowledge
US20100074528A1 (en) Coherent phrase model for efficient image near-duplicate retrieval
US20110093452A1 (en) Automatic comparative analysis
US20090327264A1 (en) Topics in Relevance Ranking Model for Web Search
Martinez-Romo et al. Web spam identification through language model analysis
CN101944099A (en) Method for automatically classifying text documents by utilizing body

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model
LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
EC01