WO2016033907A1 - Statistical machine learning-based internet hidden link detection method - Google Patents

Statistical machine learning-based internet hidden link detection method Download PDF

Info

Publication number
WO2016033907A1
WO2016033907A1 PCT/CN2014/095168 CN2014095168W WO2016033907A1 WO 2016033907 A1 WO2016033907 A1 WO 2016033907A1 CN 2014095168 W CN2014095168 W CN 2014095168W WO 2016033907 A1 WO2016033907 A1 WO 2016033907A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
webpage
text
classification model
dark
Prior art date
Application number
PCT/CN2014/095168
Other languages
French (fr)
Chinese (zh)
Inventor
孟池洁
王伟
耿光刚
隋鹏宇
Original Assignee
中国科学院计算机网络信息中心
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院计算机网络信息中心 filed Critical 中国科学院计算机网络信息中心
Publication of WO2016033907A1 publication Critical patent/WO2016033907A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Definitions

  • the invention belongs to the field of network technology and search technology, and particularly relates to an internet dark chain detection method based on statistical machine learning.
  • search engines have become an indispensable tool for netizens every day, and ranking of search results is very important for the presentation of search results.
  • Search engines have specialized algorithms (such as Google's PageRank, etc.) to measure the relative importance of the page and to determine the ranking of the search results. Since search engines use "crawlers" to crawl web content along links between web pages, most of the algorithms that measure the importance of web pages are an important factor in the external links of web pages, that is, the more links external websites point to the target web pages, The higher the weight value of the landing page, the easier it is to get to the front position in the search results.
  • the dark chain also known as the black chain, is a kind of link written in a web page, but is set to be invisible to the human eye. The purpose is to attract the crawling of the search engine crawler, and it is not displayed to the reader in the browser, only when viewing the webpage source code. Can be found.
  • the dark chain manufacturer uses the weighting algorithm of the webpage to attach importance to the link, and writes a large number of dark chains in the webpage, and the chain aims to increase the weight of the target webpage. People who participate in the use of dark chains often write a large number of pages on their own websites by illegally obtaining the rights of others' websites and writing a large number of unrelated dark chains in them, or the webmasters themselves participating in the dark chain exchange cooperation.
  • dark chains Due to its hidden nature, dark chains are difficult to find, and the network cheats in the underground industry are continually mass-embedded with dark chains in the Internet, so it is difficult to be completely eliminated.
  • the dark chain is similar to the reality of small pole advertising in the real world, known as "network psoriasis.” This kind of cheating method not only seriously affects the image reputation of the website, but also destroys the fair search engine ranking mechanism and affects the quality of the search results. Therefore, the detection of dark chains is necessary.
  • This detection method is weakly recognized for one of the hidden ways in which the dark chain is used (the invisible code is defined in the JavaScript script).
  • the hidden dark chain in this way occupies a large proportion, and the new hidden method cannot automatically respond. There will be a missed check.
  • the present invention provides a new Internet dark chain detection method, which utilizes the source code of the webpage to automatically and automatically detect whether the webpage contains the existence of a dark chain, and provides theoretical and practical support for the search engine to combat network cheating. .
  • the invention utilizes the characteristics of the webpage content to be trained, and is classified into a model training containing a dark chain and no dark chain, and then classifying the webpage to be detected into a dark chain and a dark chain.
  • Machine learning-based methods are widely used in text classification, spam filtering, anomaly detection, etc., and have proven to be effective. This method can achieve automatic mining and dynamic optimization of classification models, and is a heuristic method.
  • a method for detecting a dark chain based on statistical machine learning the steps of which include:
  • Pre-processing link extract the anchor text from the HTML source files of all the two types of web pages collected in step 1), that is, the text content of the link field, and then divide the anchor text into a single word;
  • step 2) vectorizing the data obtained in step 2), that is, the two types of text after the word segmentation;
  • step 3 The dimension reduction processing is performed on the vector corresponding to each text (step 3), but the vector corresponding to each text is obtained, but the dimension is high, but not all dimensions are meaningful, so the dimension processing needs to be reduced, that is, the feature Choose to ensure the efficiency of model training);
  • step 5 using the classifier to train the two types of data obtained in step 4) to obtain a classification model
  • step 6) The classification model obtained in step 5) is used for the unknown web page to be detected, and the dark chain detection result is obtained.
  • step 1) classifies the web page by expert annotation.
  • step 2) if it is a Chinese webpage, the open source tokenizer (such as Kenting Chinese word breaker, Mmseg, etc.) is used to split the anchor text into a single word; if it is an English webpage, then no special use is involved.
  • the word segmentation device can obtain a single word only through the vocabulary segmentation and lexical filtering steps.
  • steps 3) to 5) are implemented using open source machine learning and data mining tools, such as Weka, Scikit, Orange, and the like.
  • the invention proposes a classification method for using the anchor text in the webpage source code as a classification training set, in the training classification model Before, the anchor text is converted into a vector to select features and reduce the dimension. Then the classification model is trained by the machine learning classification algorithm. The obtained classification model can be used to automatically classify unknown web pages in batches and detect whether there are dark chains.
  • the classification model can be trained by using the dataset marked by experts, and the unknown webpage can be input into the classification model to automatically classify the webpage into two categories: dark chain and no dark chain. There is no need to invest in human knowledge of dark chain related knowledge.
  • Figure 1 is a general flow diagram of the process of the present invention.
  • FIG. 2 is a flow chart of data preparation and preprocessing of the present invention.
  • 3 is a flow chart of the classification model training of the present invention.
  • 1 is a general flow chart of a method for detecting a dark chain detection method based on statistical machine learning of the present invention, including data preparation and preprocessing processes (collecting and classifying webpage source code samples, extracting anchor text, word segmentation and vectorization), and performing classification model Training, using the classification model for unknown web pages to be detected, etc.
  • FIG. 2 illustrates the data preparation and pre-processing flow of the present invention. Proceed as follows
  • Two types of source code files are separately extracted to extract the anchor text, and the anchor text is divided into independent words. If it is a Chinese webpage, it involves the use of Chinese word segmentation tools (such as the ⁇ mmseg, etc.), and in order to reduce meaningless words and retain important words in the process of word segmentation, add a stop word list (including meaningless word words) in the Chinese word segmentation device. , pronouns, quantifiers, etc.) and custom word lexicon (specific words in dark chain anchor text).
  • Chinese word segmentation tools such as the ⁇ mmseg, etc.
  • Figure 3 illustrates the training process for the classification model of the present invention. Proceed as follows
  • Weka's feature selection function is used to reduce the dimension of the vector corresponding to each text, that is, to judge each dimension of the vector, and to see the degree of influence on the category, Weka can use different evaluation algorithms. Make feature selection.
  • a feature selection algorithm with better classification effect such as the information gain method shown in FIG. 2, the chi-square calibration method, and the like can be selected.
  • the total number of documents in the statistical sample set is N; the statistics of the text without lyrics appear when the frequency A, the negative document appears frequency B, the positive document part appears frequency C, the negative document does not The frequency D that appears.
  • N the total number of documents in the statistical sample set
  • the statistics of the text without lyrics appear when the frequency A, the negative document appears frequency B, the positive document part appears frequency C, the negative document does not The frequency D that appears.
  • Each word is sorted from the largest to the smallest, and the first K values are selected as features, that is, dimension reduction to K dimension.
  • the classification model training provided by Weka is used for classification model training.
  • a variety of classification methods can be used for classification training, such as shown in Figure 2.
  • C n calculate the frequency of occurrence of each category in the training samples and the conditional probability estimate of each category for each category (calculated as P(C i
  • x) P(x
  • the training model is then used to classify unknown web pages. Proceed as follows
  • step 2) The pre-processing steps of the source code obtained in step 1) are the same as the data pre-processing method above, that is, anchor text extraction, word segmentation, and vectorization.
  • step 2) On the test set obtained in step 2), use the already trained classification model to classify.
  • the trained classification model can be used to automatically classify unknown web pages in batches to detect whether they contain dark chains.
  • the above three stages of vectorization, feature selection and classification model training can also be independent of existing integrated tool software.
  • Weka, Scikit, Orange, etc. mentioned above can be programmed by themselves, in order to shorten the work cycle, use The open source tools mentioned above simplify the working steps.
  • Table 1 lists the accuracy and recall rates of the five classifiers and four feature extraction algorithms using the method of the present invention.
  • the dataset is a Chinese webpage (manually screened Chinese webpages containing dark chains and normal Chinese webpages containing no dark links collected from the DMOZ catalog).
  • the indicator Precision is the accuracy rate
  • Recall is the recall rate
  • F-measure It is an index value of the former two
  • the ROC areas are the ROC curve area. The closer the four indicators are to 1, the better the performance.
  • Bold representations indicate better accuracy and other data performance.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A statistical machine learning-based hidden link detection method comprises the following steps: 1) collecting real webpage source code data as a training set for a classification model, and dividing the data into a category containing hidden links and a category containing no hidden link; 2) respectively extracting anchor texts, i.e., character contents of link fields, from Html source code files of all the collected webpages of the two categories, and then segmenting the anchor texts into individual words; 3) vectorizing the two categories of word-segmented texts; 4) reducing the dimension of a vector corresponding to each text; 5) training the two categories of data obtained in the step (4) by using a classifier to obtain a classification model; and 6) applying the obtained classification model to an unknown webpage to be detected, to obtain a hidden link detection result. Whether a webpage contains hidden links or not is effectively and automatically detected using the source codes of the webpage, so that theoretical and practical support can be provided for search engines to crack down network cheating.

Description

一种基于统计机器学习的互联网暗链检测方法Internet dark chain detection method based on statistical machine learning 技术领域Technical field
本发明属于网络技术、搜索技术领域,具体涉及一种基于统计机器学习的互联网暗链检测方法。The invention belongs to the field of network technology and search technology, and particularly relates to an internet dark chain detection method based on statistical machine learning.
背景技术Background technique
搜索引擎作为互联网的重要入口,成为了网民们每日必不可少的工具,而搜索结果排名对于搜索结果的呈现非常重要。搜索引擎有专门的算法(例如谷歌的PageRank等)衡量网页相对重要程度,并以此确定搜索结果排名。由于搜索引擎是利用“爬虫”沿着网页间的链接抓取网页内容,所以大多数衡量网页重要程度的算法中,网页的外部链接是一个重要因素,即外部网站指向目标网页的链接越多,目标网页的权重值越高,在搜索结果中也就越容易排到前面的位置。搜索引擎结果高排名一定程度上能带给一个网站很高的关注度,所以很多站长在建设自己的网站时,都会相互友情链接相关网站。而其中不乏利用黑灰色技术(称为黑帽SEO)的作弊者,在网站中植入暗链就是其中一种手段。As an important portal of the Internet, search engines have become an indispensable tool for netizens every day, and ranking of search results is very important for the presentation of search results. Search engines have specialized algorithms (such as Google's PageRank, etc.) to measure the relative importance of the page and to determine the ranking of the search results. Since search engines use "crawlers" to crawl web content along links between web pages, most of the algorithms that measure the importance of web pages are an important factor in the external links of web pages, that is, the more links external websites point to the target web pages, The higher the weight value of the landing page, the easier it is to get to the front position in the search results. The high ranking of search engine results can bring a high degree of attention to a website, so many webmasters will link to related websites when they build their own websites. And among the cheaters who use black-gray technology (called black hat SEO), it is one of the means to implant a dark chain in the website.
暗链又称为黑链,是一种写在网页中,但是被设置为人眼看不见的链接,目的是吸引搜索引擎爬虫的抓取,并不在浏览器中展示给读者,只有在查看网页源码时才能发现。暗链制造者利用了网页权重算法对于链接的重视,在网页中写大量的暗链,链向意图提升权重的网页,达到提升目标网页权重的目的。参与使用暗链的人往往是通过用非法手段拿到他人网站权限并在其中写入大量的不相关的暗链,或者站长自身参与暗链交换合作,在自己的网站页面写入大量。由于其隐藏特性,暗链很难被发现,加之网络作弊地下产业在暴利诱使下不断大量在互联网中植入暗链,所以也很难被彻底清除。暗链类似于现实中电线杆小广告一样的存在,被称为“网络牛皮癣”。这种作弊手法不仅严重影响网站形象信誉,更破坏了公平的搜索引擎排名机制,影响搜索结果质量。因此暗链的检测很有必要。The dark chain, also known as the black chain, is a kind of link written in a web page, but is set to be invisible to the human eye. The purpose is to attract the crawling of the search engine crawler, and it is not displayed to the reader in the browser, only when viewing the webpage source code. Can be found. The dark chain manufacturer uses the weighting algorithm of the webpage to attach importance to the link, and writes a large number of dark chains in the webpage, and the chain aims to increase the weight of the target webpage. People who participate in the use of dark chains often write a large number of pages on their own websites by illegally obtaining the rights of others' websites and writing a large number of unrelated dark chains in them, or the webmasters themselves participating in the dark chain exchange cooperation. Due to its hidden nature, dark chains are difficult to find, and the network cheats in the underground industry are continually mass-embedded with dark chains in the Internet, so it is difficult to be completely eliminated. The dark chain is similar to the reality of small pole advertising in the real world, known as "network psoriasis." This kind of cheating method not only seriously affects the image reputation of the website, but also destroys the fair search engine ranking mechanism and affects the quality of the search results. Therefore, the detection of dark chains is necessary.
尽管搜索引擎对于黑帽SEO不断进行惩罚,但仍然有很多暗链存在于互联网中。大的搜索引擎并未公布其发现网络作弊的具体算法或方法。现在检测方法多数是站长自测,即自己检查网页源码,看是否有不明代码,或者利用工具查看网站修改时间等是否异常。这些方法对于铲除暗链的力量很有限,并且对检测人员知识要求很高。不能做到自动、大量检测。现有的已公布的百度检测暗链的一项技术专利(专利号201210049496.2,公开号:CN102622435 A)是基于规则的的检测方法,即利用隐藏技术识别结合黑白名单确定是否有暗链。这种检测方法对于暗链利用的其中一种隐藏方式(JavaScript脚本中定义不可见代码)识别较弱,目前利用这种方式隐藏的暗链占有很大比重,新的隐藏方式无法自动应变,因此会有漏检的情况。Although search engines continue to punish black hat SEO, there are still many dark chains in the Internet. Large search engines have not published specific algorithms or methods for discovering network cheats. Most of the detection methods are self-test by the webmaster, that is, check the source code of the web page to see if there is any unknown code, or use the tool to check whether the modification time of the website is abnormal. These methods have limited power to eradicate dark chains and are highly demanding for inspectors. Can not do automatic, a large number of tests. The existing published Baidu detection dark chain technology patent (patent number 201210049496.2, publication number: CN102622435 A) is a rule-based detection method that uses hidden technology to identify whether there is a dark chain in combination with a black and white list. This detection method is weakly recognized for one of the hidden ways in which the dark chain is used (the invisible code is defined in the JavaScript script). Currently, the hidden dark chain in this way occupies a large proportion, and the new hidden method cannot automatically respond. There will be a missed check.
发明内容Summary of the invention
基于现有技术的局限性,本发明提供了一种新的互联网暗链检测方法,利用网页的源码有效自动地检测网页中是否含有暗链的存在,为搜索引擎打击网络作弊提供理论和实践支持。Based on the limitations of the prior art, the present invention provides a new Internet dark chain detection method, which utilizes the source code of the webpage to automatically and automatically detect whether the webpage contains the existence of a dark chain, and provides theoretical and practical support for the search engine to combat network cheating. .
本发明利用网页内容的特征进行训练,分类为含有暗链和不含暗链的模型训练,之后将待检测的网页分类为含有暗链和不含暗链两类。基于机器学习的方法在文本分类、垃圾邮件过滤、异常检测等领域被广泛应用,并被证实切实有效。本方法可以做到分类模型的自动挖掘和动态优化,是一种启发式方法。The invention utilizes the characteristics of the webpage content to be trained, and is classified into a model training containing a dark chain and no dark chain, and then classifying the webpage to be detected into a dark chain and a dark chain. Machine learning-based methods are widely used in text classification, spam filtering, anomaly detection, etc., and have proven to be effective. This method can achieve automatic mining and dynamic optimization of classification models, and is a heuristic method.
具体来说,本发明采用的技术方案如下:Specifically, the technical solution adopted by the present invention is as follows:
一种基于统计机器学习的暗链检测方法,其步骤包括:A method for detecting a dark chain based on statistical machine learning, the steps of which include:
1)收集真实的网页源码数据作为分类模型的训练集,将其分为含有暗链和不含暗链两类;1) Collect the real webpage source data as a training set of the classification model, and divide it into two categories: dark chain and no dark chain;
2)预处理环节:从步骤1)收集的所有两类网页的HTML源码文件中分别提取锚文本,即链接字段的文字内容,再将锚文本分割为单个词语;2) Pre-processing link: extract the anchor text from the HTML source files of all the two types of web pages collected in step 1), that is, the text content of the link field, and then divide the anchor text into a single word;
3)将步骤2)中得到的数据即分词后的两类文本进行向量化;3) vectorizing the data obtained in step 2), that is, the two types of text after the word segmentation;
4)对每个文本对应的向量进行降低维度处理(步骤3)得到了每个文本对应的向量,但维度很高,但并不是所有的维度都有意义,因此需要进行降低维度处理,即特征选择,保证模型训练的效率);4) The dimension reduction processing is performed on the vector corresponding to each text (step 3), but the vector corresponding to each text is obtained, but the dimension is high, but not all dimensions are meaningful, so the dimension processing needs to be reduced, that is, the feature Choose to ensure the efficiency of model training);
5)利用分类器对步骤4)得到的两类数据进行训练,得到分类模型;5) using the classifier to train the two types of data obtained in step 4) to obtain a classification model;
6)将步骤5)得到的分类模型用于待检测的未知网页,得到暗链检测结果。6) The classification model obtained in step 5) is used for the unknown web page to be detected, and the dark chain detection result is obtained.
进一步地,步骤1)通过专家标注对网页进行分类。Further, step 1) classifies the web page by expert annotation.
进一步地,步骤2)中,如果是中文网页,则利用开源的分词器(如庖丁中文分词器、Mmseg等分词器)将锚文本分割为单个词语;如果是英文网页,那么不涉及使用专门的分词器,只通过词汇分割、词汇过滤步骤即可得到单个词语。Further, in step 2), if it is a Chinese webpage, the open source tokenizer (such as Kenting Chinese word breaker, Mmseg, etc.) is used to split the anchor text into a single word; if it is an English webpage, then no special use is involved. The word segmentation device can obtain a single word only through the vocabulary segmentation and lexical filtering steps.
进一步地,步骤3)至步骤5)采用开源的机器学习和数据挖掘工具实现,比如Weka、Scikit、Orange等。Further, steps 3) to 5) are implemented using open source machine learning and data mining tools, such as Weka, Scikit, Orange, and the like.
本发明提出了一种将网页源码中的锚文本作为分类训练集的分类方法,在训练分类模型 前,将锚文本转换为向量后进行特征选择,降低维度;然后利用机器学习的分类算法进行分类模型训练,得到的分类模型可用于批量自动地分类未知网页,检测是否含有暗链。The invention proposes a classification method for using the anchor text in the webpage source code as a classification training set, in the training classification model Before, the anchor text is converted into a vector to select features and reduce the dimension. Then the classification model is trained by the machine learning classification algorithm. The obtained classification model can be used to automatically classify unknown web pages in batches and detect whether there are dark chains.
与现有技术相比,本发明的有益效果如下:Compared with the prior art, the beneficial effects of the present invention are as follows:
1)可以利用专家标注的数据集训练分类模型,将未知网页输入到分类模型进行网页自动分类为含有暗链和不含暗链两类。不需要投入人力了解暗链相关知识。1) The classification model can be trained by using the dataset marked by experts, and the unknown webpage can be input into the classification model to automatically classify the webpage into two categories: dark chain and no dark chain. There is no need to invest in human knowledge of dark chain related knowledge.
2)利用网页源码的内容特征,不针对暗链的隐藏技术手段来检测,当出现新的隐藏技术手段可以做到动态自适应,有效检测。2) Using the content characteristics of the webpage source code, it does not detect the hidden technology of the dark chain, and when the new hidden technology means, it can achieve dynamic adaptive and effective detection.
附图说明DRAWINGS
图1是本发明方法的总体流程图。BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is a general flow diagram of the process of the present invention.
图2是本发明的数据准备和预处理流程图。2 is a flow chart of data preparation and preprocessing of the present invention.
图3是本发明的分类模型训练流程图。3 is a flow chart of the classification model training of the present invention.
具体实施方式detailed description
为使本发明的上述目的、特征和优点能够更加明显易懂,下面通过具体实施例和附图,对本发明做进一步说明。The above described objects, features and advantages of the present invention will become more apparent from the aspects of the appended claims.
图1是本发明的基于统计机器学习的暗链检测方法方法的总体流程图,包括数据准备和预处理流程(收集网页源码样本并分类、提取锚文本、分词及向量化),以及进行分类模型训练、将分类模型用于待检测的未知网页等步骤。1 is a general flow chart of a method for detecting a dark chain detection method based on statistical machine learning of the present invention, including data preparation and preprocessing processes (collecting and classifying webpage source code samples, extracting anchor text, word segmentation and vectorization), and performing classification model Training, using the classification model for unknown web pages to be detected, etc.
图2展示了本发明的数据准备和预处理流程。步骤如下Figure 2 illustrates the data preparation and pre-processing flow of the present invention. Proceed as follows
1)分别收集含有暗链的源码和不含有暗链的HTML源码文件,前者由人筛选识别选得到;后者选择DMOZ目录中收录的各类网页首页的HTML源码文件(一个由全球志愿者共同维护的开放式分类目录,互联网中最重要的网站目录导航)。两类的HTML文本可利用爬虫批量爬取网站首页得到。1) Collect the source code containing the dark chain and the HTML source file without the dark chain separately. The former is selected by the human screening and identification; the latter selects the HTML source file of the homepage of various web pages included in the DMOZ directory (one is shared by global volunteers) Maintain an open catalogue, the most important website directory navigation on the Internet). Two types of HTML text can be obtained by crawling the crawler's home page.
2)两类源码文件样本分别进行提取锚文本,将锚文本分割为独立词语。如果是中文网页,涉及到使用中文分词工具(如庖丁分词器mmseg等),并且分词过程中为了减少无意义词语和保留重要词语,在中文分词器中添加停用词表(包括无意义的虚词、代词、量词等)和自定义词语词库(暗链锚文本中特有的词语)。2) Two types of source code files are separately extracted to extract the anchor text, and the anchor text is divided into independent words. If it is a Chinese webpage, it involves the use of Chinese word segmentation tools (such as the 庖丁分词器mmseg, etc.), and in order to reduce meaningless words and retain important words in the process of word segmentation, add a stop word list (including meaningless word words) in the Chinese word segmentation device. , pronouns, quantifiers, etc.) and custom word lexicon (specific words in dark chain anchor text).
3)将两类分词后的锚文本转换为Weka需要的数据格式。3) Convert the anchor text after the two types of word segmentation into the data format that Weka needs.
4)将上一步骤得到的数据,分别输入到开源的机器学习和数据挖掘工具Weka中向量 化,即以每一个词语作为一个维度,文本存在该词语,对应维度即为1,否则为0,将所有的文本转化为对应的向量。4) Input the data obtained in the previous step into the open source machine learning and data mining tool Weka vector , that is, with each word as a dimension, the word exists in the text, the corresponding dimension is 1, otherwise it is 0, and all the text is converted into the corresponding vector.
图3展示了本发明分类模型训练流程。步骤如下Figure 3 illustrates the training process for the classification model of the present invention. Proceed as follows
1)为了保证训练模型的效率,利用Weka的特征选择功能降低每个文本对应的向量的维度,即对向量的每个维度进行评判,看其对于类别影响的程度,Weka可以利用不同的评判算法进行特征选择。可选择对于分类效果较好的特征选择算法,比如图2所示的信息增益方法、卡方校验方法等。1) In order to ensure the efficiency of the training model, Weka's feature selection function is used to reduce the dimension of the vector corresponding to each text, that is, to judge each dimension of the vector, and to see the degree of influence on the category, Weka can use different evaluation algorithms. Make feature selection. A feature selection algorithm with better classification effect, such as the information gain method shown in FIG. 2, the chi-square calibration method, and the like can be selected.
以卡方校验方法为例说明文本分类中特征选择的过程:统计样本集中文档总数N;统计没歌词的正文当出现频率A、负文档出现频率B、正文档部出现频率C、负文档不出现的频率D。对于每个词语,计算卡方值,公式如下:Taking the chi-square verification method as an example to illustrate the process of feature selection in text classification: the total number of documents in the statistical sample set is N; the statistics of the text without lyrics appear when the frequency A, the negative document appears frequency B, the positive document part appears frequency C, the negative document does not The frequency D that appears. For each word, calculate the chi-square value as follows:
Figure PCTCN2014095168-appb-000001
Figure PCTCN2014095168-appb-000001
将每个词按卡方值从大到小排序,选取前K个值作为特征,即降维至K维。Each word is sorted from the largest to the smallest, and the first K values are selected as features, that is, dimension reduction to K dimension.
2)在上一步骤得到的精简后的向量基础上,利用Weka提供的分类算法进行分类模型训练。可以使用多种分类方法进行分类训练,比如图2所示的
Figure PCTCN2014095168-appb-000002
Bayes、SVM、SMO、Adaboost等方法,根据训练结果的性能选择适合该数据集的指标最好的分类器。以AdaBoost算法为例,说明训练分类器的过程:设一个待分类项为x={a1,a2,…,am},每个a为x的一个特征属性,类别为C1,C2,…,Cn,计算每个类别在训练样本中的出现频率及每个特征属性划分对每个类别的条件概率估计(计算公式为P(Ci|x)=P(x|Ci)P(Ci)/P(x)),并将结果记录。
2) Based on the reduced vector obtained in the previous step, the classification model training provided by Weka is used for classification model training. A variety of classification methods can be used for classification training, such as shown in Figure 2.
Figure PCTCN2014095168-appb-000002
Bayes, SVM, SMO, Adaboost and other methods, according to the performance of the training results, select the best classifier suitable for the data set. Taking the AdaBoost algorithm as an example, the process of training the classifier is described. Let a class to be classified be x={a1, a2,..., am}, each a is a feature attribute of x, the category is C 1 , C 2 ,... , C n , calculate the frequency of occurrence of each category in the training samples and the conditional probability estimate of each category for each category (calculated as P(C i |x)=P(x|C i )P( C i )/P(x)) and record the results.
然后利用训练模型进行分类未知网页。步骤如下The training model is then used to classify unknown web pages. Proceed as follows
1)将待检测的网页的域名输入爬虫程序,批量抓取其网页的HTML源码,并存储为文件。1) Enter the domain name of the web page to be detected into the crawler program, and manually grab the HTML source code of the webpage and store it as a file.
2)对步骤1)得到的源码进行预处理步骤,与上面的数据预处理方法相同,即进行锚文本提取,分词,向量化。2) The pre-processing steps of the source code obtained in step 1) are the same as the data pre-processing method above, that is, anchor text extraction, word segmentation, and vectorization.
3)在步骤2)得到的测试集上,利用已经训练好的分类模型,进行分类。已经训练好的分类模型可用于批量自动地分类未知网页,检测是否含有暗链。3) On the test set obtained in step 2), use the already trained classification model to classify. The trained classification model can be used to automatically classify unknown web pages in batches to detect whether they contain dark chains.
上述向量化、特征选择以及分类模型训练三个阶段也可以不依赖于现有的集成工具软件,比如以上提到的Weka、Scikit、Orange等,可以自行编写程序完成,为了缩短工作周期,使用了以上提到的开源工具简化工作步骤。The above three stages of vectorization, feature selection and classification model training can also be independent of existing integrated tool software. For example, Weka, Scikit, Orange, etc. mentioned above can be programmed by themselves, in order to shorten the work cycle, use The open source tools mentioned above simplify the working steps.
表1列出了采用本发明方法的五种分类器和四种特征提取算法的准确率和召回率,使用 的数据集是中文网页(人工筛选得到的含有暗链的中文网页以及从DMOZ目录中收集的正常的不含暗链的中文网页)其中,指标Precision是准确率,Recall是召回率,F-measure是综合前两者的一个指标值,ROC areas值为ROC曲线面积,这四个指标均为越接近1,性能越好。加粗的表示准确度等数据性能相对更好。Table 1 lists the accuracy and recall rates of the five classifiers and four feature extraction algorithms using the method of the present invention. The dataset is a Chinese webpage (manually screened Chinese webpages containing dark chains and normal Chinese webpages containing no dark links collected from the DMOZ catalog). The indicator Precision is the accuracy rate, Recall is the recall rate, F-measure It is an index value of the former two, and the ROC areas are the ROC curve area. The closer the four indicators are to 1, the better the performance. Bold representations indicate better accuracy and other data performance.
表1.五种分类器和四种特征提取算法的准确率和召回率Table 1. Accuracy and recall rates for five classifiers and four feature extraction algorithms
Figure PCTCN2014095168-appb-000003
Figure PCTCN2014095168-appb-000003
以上实施例仅用以说明本发明的技术方案而非对其进行限制,本领域的普通技术人员可以对本发明的技术方案进行修改或者等同替换,而不脱离本发明的精神和范围,本发明的保护范围应以权利要求所述为准。 The above embodiments are only used to illustrate the technical solutions of the present invention, and the present invention is not limited thereto, and those skilled in the art can modify or replace the technical solutions of the present invention without departing from the spirit and scope of the present invention. The scope of protection shall be as stated in the claims.

Claims (8)

  1. 一种基于统计机器学习的暗链检测方法,其步骤包括:A method for detecting a dark chain based on statistical machine learning, the steps of which include:
    1)收集真实的网页源码数据作为分类模型的训练集,将其分为含有暗链和不含暗链两类;1) Collect the real webpage source data as a training set of the classification model, and divide it into two categories: dark chain and no dark chain;
    2)从两类网页的Html源码文件中分别提取锚文本,并将锚文本分割为单个词语;2) Extract the anchor text from the Html source files of the two types of web pages, and divide the anchor text into a single word;
    3)对分词后的两类文本进行向量化;3) Vectorize the two types of text after the word segmentation;
    4)对每个文本对应的向量进行降低维度处理,即进行特征选择;4) performing a dimension reduction process on the vector corresponding to each text, that is, performing feature selection;
    5)利用分类器对步骤4)得到的两类数据进行训练,得到分类模型;5) using the classifier to train the two types of data obtained in step 4) to obtain a classification model;
    6)将步骤5)得到的分类模型用于待检测的未知网页,得到暗链检测结果。6) The classification model obtained in step 5) is used for the unknown web page to be detected, and the dark chain detection result is obtained.
  2. 如权利要求1所述的方法,其特征在于:步骤1)通过专家标注将网页分为所述的两类。The method of claim 1 wherein step 1) divides the web page into the two categories by expert annotation.
  3. 如权利要求1所述的方法,其特征在于:步骤1)利用爬虫批量爬取网站首页得到两类Html文本。The method of claim 1 wherein: step 1) using the crawler to crawl the website home page to obtain two types of Html text.
  4. 如权利要求1所述的方法,其特征在于:步骤2)中,如果数据集是中文网页,则利用开源的中文分词器将锚文本分割为单个词语;如果是英文网页,则直接通过词汇分割和词汇过滤得到单个词语。The method according to claim 1, wherein in step 2), if the data set is a Chinese webpage, the open text Chinese word segmenter is used to segment the anchor text into a single word; if it is an English web page, the vocabulary is directly segmented. And lexical filtering to get a single word.
  5. 如权利要求4所述的方法,其特征在于:步骤2)在中文网页的分词过程中,在中文分词器中添加停用词表和自定义词语词库,以减少无意义词语和保留重要词语;所述自定义词语词库为暗链锚文本中特有的词语。The method according to claim 4, wherein: step 2) adding a stop word list and a custom word vocabulary in the Chinese word segmentation process to reduce meaningless words and retain important words in the Chinese word segmentation process. The custom word vocabulary is a word unique to the dark chain anchor text.
  6. 如权利要求1所述的方法,其特征在于:步骤3)至步骤5)采用开源的机器学习和数据挖掘工具实现,所述开源的机器学习和数据挖掘工具包括但不限于Weka、Scikit、Orange。The method of claim 1 wherein steps 3) through 5) are implemented using open source machine learning and data mining tools, including but not limited to Weka, Scikit, Orange. .
  7. 如权利要求6所述的方法,其特征在于:步骤3)进行向量化时,以每一个词语作为一个维度,文本存在该词语则对应维度即为1,否则为0,以此将所有的文本转化为对应的向量。The method according to claim 6, wherein in step 3), when vectorization is performed, each word is used as a dimension, and if the word exists in the text, the corresponding dimension is 1, otherwise 0, thereby all the texts are used. Convert to the corresponding vector.
  8. 如权利要求1所述的方法,其特征在于,步骤5)将分类模型用于待检测的未知网页的方法是:The method of claim 1 wherein the step 5) applying the classification model to the unknown web page to be detected is:
    a)将待检测的网页的域名输入爬虫程序,批量抓取其网页的Html源码,并存储为文件;a) input the domain name of the webpage to be detected into the crawler program, and manually capture the Html source code of the webpage and store it as a file;
    b)对步骤a)得到的源码进行预处理步骤,即进行锚文本提取、分词和向量化;b) performing a pre-processing step on the source code obtained in step a), namely performing anchor text extraction, word segmentation and vectorization;
    c)在步骤b)得到的测试集上,利用已经训练好的分类模型进行分类,以检测是否含有暗链。 c) On the test set obtained in step b), use the trained classification model to classify to detect the presence or absence of dark chains.
PCT/CN2014/095168 2014-09-05 2014-12-26 Statistical machine learning-based internet hidden link detection method WO2016033907A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410452221.2A CN104239485B (en) 2014-09-05 2014-09-05 A kind of dark chain detection method in internet based on statistical machine learning
CN201410452221.2 2014-09-05

Publications (1)

Publication Number Publication Date
WO2016033907A1 true WO2016033907A1 (en) 2016-03-10

Family

ID=52227544

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/095168 WO2016033907A1 (en) 2014-09-05 2014-12-26 Statistical machine learning-based internet hidden link detection method

Country Status (2)

Country Link
CN (1) CN104239485B (en)
WO (1) WO2016033907A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541476A (en) * 2020-12-24 2021-03-23 西安交通大学 Malicious webpage identification method based on semantic feature extraction
CN112968875A (en) * 2021-01-29 2021-06-15 上海安恒时代信息技术有限公司 Network relationship construction method and system
CN113965385A (en) * 2021-10-25 2022-01-21 恒安嘉新(北京)科技股份公司 Monitoring processing method, device, equipment and medium for abnormal website
CN115277211A (en) * 2022-07-29 2022-11-01 哈尔滨工业大学(威海) Multi-mode pornography and gambling domain name automatic detection method based on text and images
CN118349756A (en) * 2024-06-17 2024-07-16 江苏省互联网行业管理服务中心 Bad website identification method and system based on source code structure and resource link

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239485B (en) * 2014-09-05 2018-05-01 中国科学院计算机网络信息中心 A kind of dark chain detection method in internet based on statistical machine learning
CN105512285B (en) * 2015-12-07 2018-11-06 南京大学 Adaptive network reptile method based on machine learning
CN107122327B (en) 2016-02-25 2021-06-29 阿里巴巴集团控股有限公司 Method and training system for training model by using training data
CN107016298B (en) * 2017-03-27 2020-07-10 北京神州绿盟信息安全科技股份有限公司 Webpage tampering monitoring method and device
CN107273416B (en) * 2017-05-05 2021-05-04 深信服科技股份有限公司 Webpage hidden link detection method and device and computer readable storage medium
CN107566391B (en) * 2017-09-20 2020-04-14 上海斗象信息科技有限公司 Method for detecting webpage dark chain by constructing machine learning model through domain identification and theme identification
CN107741959A (en) * 2017-09-21 2018-02-27 北京知道未来信息技术有限公司 A kind of pseudo- static URL recognition methods and system based on machine learning
CN109165529B (en) * 2018-08-14 2021-05-07 杭州安恒信息技术股份有限公司 Dark chain tampering detection method and device and computer readable storage medium
CN109213918A (en) * 2018-09-25 2019-01-15 杭州安恒信息技术股份有限公司 The dark chain detection method of webpage and device based on machine learning
CN109522494B (en) * 2018-11-08 2020-09-15 杭州安恒信息技术股份有限公司 Dark chain detection method, device, equipment and computer readable storage medium
CN109617864B (en) * 2018-11-27 2021-04-16 烟台中科网络技术研究所 Website identification method and website identification system
CN109597926A (en) * 2018-12-03 2019-04-09 山东建筑大学 A kind of information acquisition method and system based on social media emergency event
CN109981630B (en) * 2019-03-19 2022-03-29 齐鲁工业大学 Intrusion detection method and system based on chi-square inspection and LDOF algorithm
CN111079042B (en) * 2019-12-03 2023-08-15 杭州安恒信息技术股份有限公司 Webpage hidden chain detection method and device based on text theme
CN112487321A (en) * 2020-12-08 2021-03-12 北京天融信网络安全技术有限公司 Detection method, detection device, storage medium and electronic equipment
CN113810400A (en) * 2021-09-13 2021-12-17 北京百度网讯科技有限公司 Website parasite detection method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8392823B1 (en) * 2003-12-04 2013-03-05 Google Inc. Systems and methods for detecting hidden text and hidden links
CN103679053A (en) * 2013-11-29 2014-03-26 北京奇虎科技有限公司 Webpage tampering detection method and device
CN103856442A (en) * 2012-11-30 2014-06-11 腾讯科技(深圳)有限公司 Black chain detection method, apparatus and system
CN104239485A (en) * 2014-09-05 2014-12-24 中国科学院计算机网络信息中心 Statistical machine learning-based internet hidden link detection method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7743045B2 (en) * 2005-08-10 2010-06-22 Google Inc. Detecting spam related and biased contexts for programmable search engines
CN101350011B (en) * 2007-07-18 2011-09-07 中国科学院自动化研究所 Method for detecting search engine cheat based on small sample set
CN101493819B (en) * 2008-01-24 2011-09-14 中国科学院自动化研究所 Method for optimizing detection of search engine cheat
CN102004764A (en) * 2010-11-04 2011-04-06 中国科学院计算机网络信息中心 Internet bad information detection method and system
CN103150369A (en) * 2013-03-07 2013-06-12 人民搜索网络股份公司 Method and device for identifying cheat web-pages

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8392823B1 (en) * 2003-12-04 2013-03-05 Google Inc. Systems and methods for detecting hidden text and hidden links
CN103856442A (en) * 2012-11-30 2014-06-11 腾讯科技(深圳)有限公司 Black chain detection method, apparatus and system
CN103679053A (en) * 2013-11-29 2014-03-26 北京奇虎科技有限公司 Webpage tampering detection method and device
CN104239485A (en) * 2014-09-05 2014-12-24 中国科学院计算机网络信息中心 Statistical machine learning-based internet hidden link detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LI, XUAN;: "Research on the Web Classification Based on url Feature", CHINA MASTER'S THESES FULL-TEXT DATABASE, 28 March 2011 (2011-03-28), pages 6,8 and 9 *
XU, ZHENHU;: "Research on Algorithms for Detecting Web Link Spam", CHINA MASTER'S THESES FULL-TEXT DATABASE, 1 July 2012 (2012-07-01), pages 9,16,18 and 19 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541476A (en) * 2020-12-24 2021-03-23 西安交通大学 Malicious webpage identification method based on semantic feature extraction
CN112541476B (en) * 2020-12-24 2023-09-29 西安交通大学 Malicious webpage identification method based on semantic feature extraction
CN112968875A (en) * 2021-01-29 2021-06-15 上海安恒时代信息技术有限公司 Network relationship construction method and system
CN113965385A (en) * 2021-10-25 2022-01-21 恒安嘉新(北京)科技股份公司 Monitoring processing method, device, equipment and medium for abnormal website
CN113965385B (en) * 2021-10-25 2024-06-11 恒安嘉新(北京)科技股份公司 Monitoring processing method, device, equipment and medium for abnormal website
CN115277211A (en) * 2022-07-29 2022-11-01 哈尔滨工业大学(威海) Multi-mode pornography and gambling domain name automatic detection method based on text and images
CN115277211B (en) * 2022-07-29 2023-07-28 哈尔滨工业大学(威海) Text and image-based multi-mode pornography and gambling domain name automatic detection method
CN118349756A (en) * 2024-06-17 2024-07-16 江苏省互联网行业管理服务中心 Bad website identification method and system based on source code structure and resource link

Also Published As

Publication number Publication date
CN104239485B (en) 2018-05-01
CN104239485A (en) 2014-12-24

Similar Documents

Publication Publication Date Title
WO2016033907A1 (en) Statistical machine learning-based internet hidden link detection method
TWI735543B (en) Method and device for webpage text classification, method and device for webpage text recognition
CN107193959B (en) Pure text-oriented enterprise entity classification method
WO2019200806A1 (en) Device for generating text classification model, method, and computer readable storage medium
CN110263166A (en) Public sentiment file classification method based on deep learning
TWI437452B (en) Web spam page classification using query-dependent data
CN107301171A (en) A kind of text emotion analysis method and system learnt based on sentiment dictionary
CN107239439A (en) Public sentiment sentiment classification method based on word2vec
CN105512285B (en) Adaptive network reptile method based on machine learning
CN109522562B (en) Webpage knowledge extraction method based on text image fusion recognition
CN103577755A (en) Malicious script static detection method based on SVM (support vector machine)
CN105912716A (en) Short text classification method and apparatus
CN103544436A (en) System and method for distinguishing phishing websites
CN103064971A (en) Scoring and Chinese sentiment analysis based review spam detection method
CN104361059B (en) A kind of harmful information identification and Web page classification method based on multi-instance learning
CN109918648B (en) Rumor depth detection method based on dynamic sliding window feature score
CN112434163A (en) Risk identification method, model construction method, risk identification device, electronic equipment and medium
CN104794209B (en) Chinese microblogging mood sorting technique based on Markov logical network and system
Flisar et al. Enhanced feature selection using word embeddings for self-admitted technical debt identification
CN114692593A (en) Network information safety monitoring and early warning method
CN111475651A (en) Text classification method, computing device and computer storage medium
CN106649264A (en) Text information-based Chinese fruit variety information extracting method and device
CN109214275B (en) Vulgar picture identification method based on deep learning
CN112084376A (en) Map knowledge based recommendation method and system and electronic device
CN113761914B (en) Internet text weather disaster event identification method based on SVM model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14901275

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14901275

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 14901275

Country of ref document: EP

Kind code of ref document: A1