CN101035128A - Three-folded webpage text content recognition and filtering method based on the Chinese punctuation - Google Patents

Three-folded webpage text content recognition and filtering method based on the Chinese punctuation Download PDF

Info

Publication number
CN101035128A
CN101035128A CNA2007100110571A CN200710011057A CN101035128A CN 101035128 A CN101035128 A CN 101035128A CN A2007100110571 A CNA2007100110571 A CN A2007100110571A CN 200710011057 A CN200710011057 A CN 200710011057A CN 101035128 A CN101035128 A CN 101035128A
Authority
CN
China
Prior art keywords
text
information
filtering
url
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007100110571A
Other languages
Chinese (zh)
Other versions
CN101035128B (en
Inventor
宋明秋
吴新涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN2007100110571A priority Critical patent/CN101035128B/en
Publication of CN101035128A publication Critical patent/CN101035128A/en
Application granted granted Critical
Publication of CN101035128B publication Critical patent/CN101035128B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

A method based on Chinese website punctuation triple recognition and text content filtering. The method based on existing URL, the website information keywords in the method of filtration - filtration rate and the low rate of filtration of the whole problem, Bringing on a method for composite based on the URL and on keywords, as well as text-based knowledge representation method of vector space website text content filtering. Applying to a method Based on black-and-white list of URL filtering and Chinese punctuation statistical characteristics to effectively remove navigation information, relevant linked information, advertising linked information, copyright information and other Web content noise information to extract content of text; adopting vector space model text knowledge representation, By calculating vector text template and unhealthy information in the feature vector cosine angle, and set the threshold, compared to the text of the class. The invention can be widely used in the filtering of undesirable information network and website personalized information services.

Description

Triple webpage text content identifications and filter method based on Chinese punctuation mark
Technical field
The invention belongs to filed of network information security, relate to the identification and the filtration of the bad text message of Chinese web page.
Background technology
In existing several web page contents safety products, as " network nurse " and " network father " etc., their mostly adopt based on the method for URL address and keyword and forbid visit to illegal web page and website, for the diversity and dynamic of online illegal contents, this method that adopts static address base or manually upgrade network address and keyword far can not satisfy people's filtration requirement, and the heads of a family expect to have the more effectively and comprehensively appearance of information filtering product.
Existing filter method for webpage text content mainly carries out round vector space model.
Liu Peide etc. utilize vector space model, TC3 sorting algorithm, Rocchio feedback model etc. to construct a network information filtration system (NIFS) with feedback mechanism, and this system can realize the text filtering based on the user interest file.
The information safety filtrating system based on vector space model that Cao Yi, He Weihong set up then is divided into filtration the masterplate training and two stages of adaptive filtering carry out.In the training stage, set up initial filtering template by theme processing and feature extraction, initial threshold is set; At filtration stage, then adjust masterplate and threshold value adaptively according to user's feedback information, the characteristics of this method are mainly reflected in the design of filtering template training algorithm.
Shian-Hua Lin and Jan-Ming Ho be in proposing a method of removing noise content in the webpage in 2002, this method according in the webpage<table the tag tree of label configurations webpage, throwing the net one, page or leaf is regular to be mutually nested content piece; Then, for the webpage collection that the same masterplate of use generates, finding out at this webpage and concentrate the content piece that repeatedly occurs, as the noise content, is exactly the effective information piece and concentrate the less content piece of appearance at this webpage.
Fudan University has proposed the Internet filtration system and the filter method of a kind of content-based filtering proxy (CFA), and system framework comprises: information filtering agency (CFA), querying server (QS), content analysis and management server (CAMS) three parts.The filtering process of Web content filtration system is: when the user sent the request that certain URL is conducted interviews, CFA was according to the black and white lists that the user is provided with, and allowed or forbade this access request.If this URL is not in the black and white lists of CFA, CFA then sends query requests to querying server QS.QS will inquire about the rating information of this URL and the result is returned to CFA in the URL storehouse of oneself.CFA makes a response in view of the above.QS meeting simultaneously is the URL rating information of down loading updating from CAMS regularly.
And " the information filtering technology that is used for network browsing " of Microsoft provides a kind of user of control could visit the system and method for some internet site when using a computer.When the computer user attempts to visit one during by the internet site of specifying uniform resource locator (URL) to point to, filter is tabulated by permission-prevention and is provided reference to URL, and by reference---the cross reference age group checks that age group allows the categorised content mapping table of watching, and correspondingly determines the visit to the website of URL sensing.
Sum up previous finding, can see that the internet information filter method still has the following disadvantages up till now:
1. adopt the filter method of URL and keyword, filtration accuracy rate and the full rate of filter are lower, and filter is easy to be bypassed;
2. employing is slow based on the content filtering method rate of filtration in text vector space separately, can't satisfy the requirement of broadband network transfer of data real time filtering;
3. less for the preprocessing process research of webpage, especially do not see bibliographical information as yet, and the research of this respect problem can improve the speed that web data is handled effectively about the research of generic web pages body matter extracting method;
4. content recognition and the filter method at the Chinese web page characteristics also has not seen reported.
Summary of the invention
Filter the limitation that accurate rate, the full rate of filter and the rate of filtration can't satisfy network traffics in order to overcome existing info web filter method, the invention provides a kind of with existing based on URL, based on keyword and the triple filter method that organically merges based on the text filtering method of vector space; In url filtering, be provided with legal URL and illegal URL table, promptly black and white lists improves the speed of filtering; Adopt Winsock 2 SPI directly to intercept and capture the HTTP packet, saved the trouble that when bottom intercepted data bag, will recombinate with protocol analysis in application layer; Text recognition of Chinese web page text and denoising method based on Chinese punctuation mark statistical value have been proposed.
For reaching above-mentioned target, the present invention adopts following technical scheme:
System adopts the three-stage filtration pattern, is respectively url filtering, keyword filtration, text content filtering.
System configuration as shown in Figure 1, wherein:
The url filtering module
By illegal url list (blacklist) and the legal url list (white list) that sets in advance, judge whether user's request is legal.
Content is intercepted and captured and extraction module
Intercept and capture the suspicious request responding of returning from server end (HTTP packet) earlier, extract html document then, the ultimate analysis html document extracts link information and body matter.
The keyword filtration module
At link information, judge whether contain illegal link in the webpage with keyword, as long as contain illegal link, this webpage also can obtain shielding.
The information filtering module
The suspicious Web page text that contains legal link is carried out participle, removes stop words, calculates weight and feature extraction, be expressed as vector space model afterwards, and be complementary, judge whether its content is legal with the characteristic vector that trains.
The operating procedure of system of the present invention is summarized as follows:
1. when the user sends linking request, compared with the address list in the black and white lists in the request URL address, and handle accordingly.For neither belonging to the request address that blacklist does not belong to white list yet, be labeled as suspicious request.
2. intercept and capture suspicious request responding, i.e. the HTTP packet that returns of server end.Because Winsock 2 SPI intercept and capture in application layer, thus the trouble that when bottom intercepted data bag, will carry out packet reorganization and protocol analysis saved, the efficient height, CPU usage is low.
3. from the HTTP packet of intercepting and capturing, extract html file, therefrom extract link information, and adopt Web page text content identification method to obtain the Web page text content of text based on Chinese punctuation mark statistical value.
4. adopt filter method, check link information,, return warning message, otherwise change the information filtering module if be non-legal link based on keyword.
5. set up Chinese web page flame text classification corpus, as the sample training masterplate of webpage text content.The Web page text implementation content is filtered, check its legitimacy, return to the user for legal content of text, illegal content of text directly shields, and upgrades url list.
Effect of the present invention and benefit are to adopt Winsock 2SPI function directly to intercept and capture the HTTP packet in application layer, have saved the trouble that will recombinate when bottom intercepted data bag with protocol analysis.Employing can effectively be removed noise informations such as navigation information, peer link information, advertisement link information, copyright information based on the webpage text content identification and the acquisition methods of Chinese punctuation mark statistical value.The present invention can improve speed, accuracy rate and the filtering accuracy that info web filters effectively.The filtration of Chinese web page flame can be used for, and user individual text classification information service field can be widely used in.
Description of drawings
Fig. 1 is based on the webpage text content filtration system overall construction drawing of Chinese punctuation mark.
Fig. 2 is the url filtering flow chart.
Fig. 3 is the info web HTML nested structure and the representation of knowledge of HTML tree.
Fig. 4 is the information filtering process chart.
Embodiment
Below in conjunction with technical scheme and accompanying drawing, be described in detail the specific embodiment of the present invention.
Step 1
When the user imports a certain network address in browser's address bar, or in the webpage clicking during a certain link information, compare (as shown in Figure 2) with the address list in the black and white lists in the URL address that filter will be asked, for the URL request that belongs in the white list, system lets pass; For the URL request that belongs in the blacklist, system mask is also returned warning message; For neither belonging to the URL that blacklist does not belong to white list yet, be labeled as suspicious request, execution in step 2.
Step 2
Adopt Winsock 2SPI technology to intercept and capture the HTTP packet that suspicious requested service device end returns.
Step 3
From the HTTP packet that the 2nd step is intercepted and captured, extract html file, analyze html file and extract link information; And analyze HTML tree (as shown in Figure 3), and adopt webpage context extraction method based on Chinese punctuation mark, remove noise informations such as navigation information, peer link information, advertisement link information, copyright information effectively, obtain the Web page text content of text.
Step 4
The hyperlinked information that extracts for step 3, check whether contain illegal keyword in the link with the method for pattern matching, if have, then this link is defined as illegal link, this link of system mask is also returned warning message, otherwise execution in step 5 is carried out information filtering, judges the legitimacy of web page contents.
Information filtering is the core of native system, its basic filtering flow process as shown in Figure 4, filtration step is as follows:
Step 5
For the suspicious Web page text content that extracts by step 3 and step 4, adopt and carry out word segmentation processing based on dictionary and forward maximum matching algorithm.
Step 6
According to the stop words in the vocabulary removal word segmentation result of stopping using, promptly remove some insignificant speech, eliminate of the influence of these speech to judged result.
Step 7
Use the method for word frequency statistics, carry out the feature speech and extract, promptly extract the speech that more can show file characteristics, to improve program efficiency, the speed of service and nicety of grading.
Step 8
Adopt TF-IDF formula calculated characteristics speech weight.
Step 9
Generate the characteristic vector of the text, calculate in this vector and the characteristic vector storehouse included angle cosine between sample vector, obtain the similarity value.
Step 10
This similarity value and the threshold value that sets are compared, and it is 0.6-08 that the present invention is provided with threshold value, determines web page contents character.When the similarity value is higher than the threshold value of regulation, then this webpage is defined as illegally, system's denied access; Be lower than the threshold value of regulation as similarity, then the text is defined as legally, and system accepts the interview.
Step 11
Upgrade legal URL and illegal url list, add in the blacklist URL address that is about to be defined as illegal text, and add in the white list URL address of legal text, to avoid that same web page contents is repeated information filtering, improves filter efficiency.
The execution of foregoing filter method needs the sample vector masterplate in the characteristic vector storehouse, and the sample vector masterplate obtains by illegal corpus Chinese version training, training process as shown in Figure 4, step is as follows:
1) sets up network flame corpus.
2), adopt based on the method for dictionary and the maximum coupling of forward the training document is carried out the Chinese word segmentation processing for the samples of text in the illegal corpus.
3) according to the stop words in the vocabulary removal word segmentation result of stopping using, obtain the higher-dimension word set.
4) above-mentioned higher-dimension word set is carried out feature extraction with the method for word frequency statistics.
5) weight of employing TF-IDF formula calculated characteristics speech.
6) vector space model of generation document deposits the characteristic vector storehouse in, generates the sample vector masterplate.

Claims (1)

1. the triple webpage text contents based on Chinese punctuation mark are discerned and filter method, a kind of triple info web filtration system architectures that combine based on URL address, keyword and content are provided, it is characterized in that, adopt Winsock 2 SPI functions directly to intercept and capture the HTTP packet in application layer; Employing is based on the general Chinese web page noise remove and the text acquisition methods of Chinese punctuation mark statistical value; Set up Chinese web page flame text classification corpus, as the sample training masterplate of webpage text content.
CN2007100110571A 2007-04-18 2007-04-18 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation Expired - Fee Related CN101035128B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2007100110571A CN101035128B (en) 2007-04-18 2007-04-18 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2007100110571A CN101035128B (en) 2007-04-18 2007-04-18 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation

Publications (2)

Publication Number Publication Date
CN101035128A true CN101035128A (en) 2007-09-12
CN101035128B CN101035128B (en) 2010-04-21

Family

ID=38731427

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007100110571A Expired - Fee Related CN101035128B (en) 2007-04-18 2007-04-18 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation

Country Status (1)

Country Link
CN (1) CN101035128B (en)

Cited By (52)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page
CN101901314A (en) * 2009-06-19 2010-12-01 卡巴斯基实验室封闭式股份公司 The detection of wrong report and minimizing during anti-malware is handled
CN102054030A (en) * 2010-12-17 2011-05-11 惠州Tcl移动通信有限公司 Mobile terminal webpage display control method and device
CN102106114A (en) * 2008-05-28 2011-06-22 兹斯卡勒公司 Distributed security provisioning
CN102136973A (en) * 2010-09-08 2011-07-27 乔永清 System and method for monitoring real data of website
CN102170640A (en) * 2011-06-01 2011-08-31 南通海韵信息技术服务有限公司 Mode library-based smart mobile phone terminal adverse content website identifying method
CN102236654A (en) * 2010-04-26 2011-11-09 广东开普互联信息科技有限公司 Web useless link filtering method based on content relevancy
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device
CN102469117A (en) * 2010-11-08 2012-05-23 中国移动通信集团广东有限公司 Method and device for identifying abnormal access action
CN102546576A (en) * 2010-12-31 2012-07-04 北京启明星辰信息技术股份有限公司 Webpagehanging trojan detecting and protecting method and system as well as method for extracting corresponding code
CN102567534A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Interactive product user generated content intercepting system and intercepting method for the same
CN102622435A (en) * 2012-02-29 2012-08-01 百度在线网络技术(北京)有限公司 Method and device for detecting black chain
CN102624703A (en) * 2011-12-31 2012-08-01 成都市华为赛门铁克科技有限公司 Method and device for filtering uniform resource locators (URLs)
CN102647408A (en) * 2012-02-27 2012-08-22 珠海市君天电子科技有限公司 Method for judging phishing website based on content analysis
CN102754488A (en) * 2011-04-18 2012-10-24 华为技术有限公司 User access control method, apparatus and system
CN102855320A (en) * 2012-09-04 2013-01-02 珠海市君天电子科技有限公司 Method and device for collecting keyword related URL (uniform resource locator) by search engine
CN102902793A (en) * 2012-09-29 2013-01-30 北京奇虎科技有限公司 Creation system and method of webpage classification knowledge base
CN102929872A (en) * 2011-08-08 2013-02-13 阿里巴巴集团控股有限公司 Computer-implemented message filtering method, message filtering device and system
CN103092994A (en) * 2013-02-20 2013-05-08 苏州思方信息科技有限公司 Support vector machine (SVM) text automatic sorting method and system based on information concept lattice correction
CN101855632B (en) * 2007-11-08 2013-10-30 上海惠普有限公司 URL and anchor text analysis for focused crawling
CN103581144A (en) * 2012-08-06 2014-02-12 无锡稳捷网络技术有限公司 Network safety access control method based on ICAP
CN101739439B (en) * 2009-11-30 2014-03-12 中兴通讯股份有限公司 Method and system for dynamically customizing statistical object based on template
CN103778226A (en) * 2014-01-23 2014-05-07 北京奇虎科技有限公司 Method for establishing language information recognition model and language information recognition device
CN103853747A (en) * 2012-11-30 2014-06-11 腾讯科技(深圳)有限公司 Method and device for controlling sound source webpage
CN104052722A (en) * 2013-03-15 2014-09-17 腾讯科技(深圳)有限公司 Web address security detection method, apparatus and system
CN104079528A (en) * 2013-03-26 2014-10-01 北大方正集团有限公司 Method and system of safety protection of Web application
CN104462613A (en) * 2012-06-20 2015-03-25 北京奇虎科技有限公司 Hot spot aggregating method and device
CN104951553A (en) * 2015-06-30 2015-09-30 成都蓝码科技发展有限公司 Content collecting and data mining platform accurate in data processing and implementation method thereof
CN105100904A (en) * 2014-05-09 2015-11-25 深圳市快播科技有限公司 Video advertisement blocking method, device and browser
CN105491023A (en) * 2015-11-24 2016-04-13 国网智能电网研究院 Data isolation exchange and security filtering method orienting electric power internet of things
CN105608083A (en) * 2014-11-13 2016-05-25 北京搜狗科技发展有限公司 Method and device for obtaining input library, and electronic equipment
CN105812417A (en) * 2014-12-29 2016-07-27 国基电子(上海)有限公司 Remote server, router and bad webpage information filtering method
CN105975395A (en) * 2016-05-30 2016-09-28 深圳市华傲数据技术有限公司 Website state reconnaissance method and device
CN103605691B (en) * 2013-11-04 2017-04-26 北京奇虎科技有限公司 Device and method used for processing issued contents in social network
CN106649338A (en) * 2015-10-30 2017-05-10 中国移动通信集团公司 Information filtering policy generation method and apparatus
CN106789980A (en) * 2016-12-07 2017-05-31 北京亚鸿世纪科技发展有限公司 A kind of monitoring administration method and device of website legitimacy
CN107122350A (en) * 2017-04-27 2017-09-01 北京易麦克科技有限公司 A kind of feature extraction system and method for many paragraph texts
CN107766551A (en) * 2017-10-31 2018-03-06 广东小天才科技有限公司 Website auditing and controlling method based on big data analysis and terminal equipment
CN107835197A (en) * 2017-12-15 2018-03-23 江苏盖亚建筑工程有限公司 A kind of network transmission system
CN108153872A (en) * 2017-12-25 2018-06-12 佛山市车品匠汽车用品有限公司 A kind of method and apparatus of the Internet web page information filtering
CN109274632A (en) * 2017-07-12 2019-01-25 中国移动通信集团广东有限公司 A kind of recognition methods of website and device
CN109688205A (en) * 2018-12-07 2019-04-26 麒麟合盛网络技术股份有限公司 The hold-up interception method and device of web page resources
CN109743309A (en) * 2018-12-28 2019-05-10 微梦创科网络科技(中国)有限公司 A kind of illegal request recognition methods, device and electronic equipment
CN110020075A (en) * 2017-10-20 2019-07-16 南京烽火软件科技有限公司 Device is excavated in illegal website automatically
CN110110195A (en) * 2019-05-07 2019-08-09 宜人恒业科技发展(北京)有限公司 A kind of impurity sweep-out method and device
CN110750639A (en) * 2019-07-02 2020-02-04 厦门美域中央信息科技有限公司 Text classification and R language realization based on vector space model
CN111259237A (en) * 2020-01-13 2020-06-09 中国搜索信息科技股份有限公司 Method for identifying public harmful information
CN111382061A (en) * 2018-12-29 2020-07-07 北京搜狗科技发展有限公司 Test method, test device, test medium and electronic equipment
CN111741007A (en) * 2020-07-06 2020-10-02 桦蓥(上海)信息科技有限责任公司 Financial business real-time monitoring system and method based on network layer message analysis
CN114024947A (en) * 2022-01-05 2022-02-08 北京微步在线科技有限公司 Web access method and device based on browser
CN116502009A (en) * 2023-06-25 2023-07-28 北京奇虎科技有限公司 Webpage filtering method, device, equipment and storage medium
CN117176483A (en) * 2023-11-03 2023-12-05 北京艾瑞数智科技有限公司 Abnormal URL identification method and device and related products

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system
US20070022202A1 (en) * 2005-07-22 2007-01-25 Finkle Karyn S System and method for deactivating web pages
CN100361450C (en) * 2005-11-18 2008-01-09 郑州金惠计算机系统工程有限公司 System for blocking off erotic images and unhealthy information in internet

Cited By (75)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101855632B (en) * 2007-11-08 2013-10-30 上海惠普有限公司 URL and anchor text analysis for focused crawling
CN102106114A (en) * 2008-05-28 2011-06-22 兹斯卡勒公司 Distributed security provisioning
CN102106114B (en) * 2008-05-28 2014-10-22 兹斯卡勒公司 Distributed security provisioning method and its system
CN101727461A (en) * 2008-10-13 2010-06-09 中国科学院计算技术研究所 Method for extracting content of web page
CN101901314B (en) * 2009-06-19 2013-07-17 卡巴斯基实验室封闭式股份公司 Detection and minimization of false positives in anti-malware processing
CN101901314A (en) * 2009-06-19 2010-12-01 卡巴斯基实验室封闭式股份公司 The detection of wrong report and minimizing during anti-malware is handled
CN101739439B (en) * 2009-11-30 2014-03-12 中兴通讯股份有限公司 Method and system for dynamically customizing statistical object based on template
CN102236654A (en) * 2010-04-26 2011-11-09 广东开普互联信息科技有限公司 Web useless link filtering method based on content relevancy
CN102136973A (en) * 2010-09-08 2011-07-27 乔永清 System and method for monitoring real data of website
CN102411587A (en) * 2010-09-21 2012-04-11 腾讯科技(深圳)有限公司 Webpage classification method and device
CN102411587B (en) * 2010-09-21 2013-08-21 腾讯科技(深圳)有限公司 Webpage classification method and device
CN102469117B (en) * 2010-11-08 2014-11-05 中国移动通信集团广东有限公司 Method and device for identifying abnormal access action
CN102469117A (en) * 2010-11-08 2012-05-23 中国移动通信集团广东有限公司 Method and device for identifying abnormal access action
CN102054030A (en) * 2010-12-17 2011-05-11 惠州Tcl移动通信有限公司 Mobile terminal webpage display control method and device
CN102546576A (en) * 2010-12-31 2012-07-04 北京启明星辰信息技术股份有限公司 Webpagehanging trojan detecting and protecting method and system as well as method for extracting corresponding code
CN102546576B (en) * 2010-12-31 2015-11-18 北京启明星辰信息技术股份有限公司 A kind of web page horse hanging detects and means of defence, system and respective code extracting method
CN102754488A (en) * 2011-04-18 2012-10-24 华为技术有限公司 User access control method, apparatus and system
CN102170640A (en) * 2011-06-01 2011-08-31 南通海韵信息技术服务有限公司 Mode library-based smart mobile phone terminal adverse content website identifying method
CN102929872B (en) * 2011-08-08 2016-04-27 阿里巴巴集团控股有限公司 By computer-implemented information filtering method, message screening Apparatus and system
CN102929872A (en) * 2011-08-08 2013-02-13 阿里巴巴集团控股有限公司 Computer-implemented message filtering method, message filtering device and system
CN102624703B (en) * 2011-12-31 2015-01-21 华为数字技术(成都)有限公司 Method and device for filtering uniform resource locators (URLs)
CN102567534A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Interactive product user generated content intercepting system and intercepting method for the same
US9331981B2 (en) 2011-12-31 2016-05-03 Huawei Technologies Co., Ltd. Method and apparatus for filtering URL
CN102567534B (en) * 2011-12-31 2014-02-19 凤凰在线(北京)信息技术有限公司 Interactive product user generated content intercepting system and intercepting method for the same
CN102624703A (en) * 2011-12-31 2012-08-01 成都市华为赛门铁克科技有限公司 Method and device for filtering uniform resource locators (URLs)
CN102647408A (en) * 2012-02-27 2012-08-22 珠海市君天电子科技有限公司 Method for judging phishing website based on content analysis
CN102622435B (en) * 2012-02-29 2017-12-12 百度在线网络技术(北京)有限公司 A kind of method and apparatus for detecting black chain
CN102622435A (en) * 2012-02-29 2012-08-01 百度在线网络技术(北京)有限公司 Method and device for detecting black chain
CN104462613A (en) * 2012-06-20 2015-03-25 北京奇虎科技有限公司 Hot spot aggregating method and device
CN103581144A (en) * 2012-08-06 2014-02-12 无锡稳捷网络技术有限公司 Network safety access control method based on ICAP
CN102855320A (en) * 2012-09-04 2013-01-02 珠海市君天电子科技有限公司 Method and device for collecting keyword related URL (uniform resource locator) by search engine
CN102902793A (en) * 2012-09-29 2013-01-30 北京奇虎科技有限公司 Creation system and method of webpage classification knowledge base
CN102902793B (en) * 2012-09-29 2016-12-21 北京奇虎科技有限公司 Webpage category knowledge base set up system and method
CN103853747A (en) * 2012-11-30 2014-06-11 腾讯科技(深圳)有限公司 Method and device for controlling sound source webpage
CN103853747B (en) * 2012-11-30 2018-09-04 腾讯科技(深圳)有限公司 A kind of control method and device of sound source webpage
CN103092994A (en) * 2013-02-20 2013-05-08 苏州思方信息科技有限公司 Support vector machine (SVM) text automatic sorting method and system based on information concept lattice correction
CN104052722A (en) * 2013-03-15 2014-09-17 腾讯科技(深圳)有限公司 Web address security detection method, apparatus and system
CN104079528A (en) * 2013-03-26 2014-10-01 北大方正集团有限公司 Method and system of safety protection of Web application
CN103605691B (en) * 2013-11-04 2017-04-26 北京奇虎科技有限公司 Device and method used for processing issued contents in social network
CN103778226A (en) * 2014-01-23 2014-05-07 北京奇虎科技有限公司 Method for establishing language information recognition model and language information recognition device
CN105100904A (en) * 2014-05-09 2015-11-25 深圳市快播科技有限公司 Video advertisement blocking method, device and browser
CN105608083A (en) * 2014-11-13 2016-05-25 北京搜狗科技发展有限公司 Method and device for obtaining input library, and electronic equipment
CN105608083B (en) * 2014-11-13 2019-09-03 北京搜狗科技发展有限公司 Obtain the method, apparatus and electronic equipment of input magazine
CN105812417A (en) * 2014-12-29 2016-07-27 国基电子(上海)有限公司 Remote server, router and bad webpage information filtering method
CN105812417B (en) * 2014-12-29 2019-05-03 国基电子(上海)有限公司 Remote server, router and bad webpage information filtering method
CN104951553A (en) * 2015-06-30 2015-09-30 成都蓝码科技发展有限公司 Content collecting and data mining platform accurate in data processing and implementation method thereof
CN106649338A (en) * 2015-10-30 2017-05-10 中国移动通信集团公司 Information filtering policy generation method and apparatus
CN106649338B (en) * 2015-10-30 2020-08-21 中国移动通信集团公司 Information filtering strategy generation method and device
CN105491023A (en) * 2015-11-24 2016-04-13 国网智能电网研究院 Data isolation exchange and security filtering method orienting electric power internet of things
CN105491023B (en) * 2015-11-24 2020-10-27 国网智能电网研究院 Data isolation exchange and safety filtering method for power Internet of things
CN105975395A (en) * 2016-05-30 2016-09-28 深圳市华傲数据技术有限公司 Website state reconnaissance method and device
CN106789980A (en) * 2016-12-07 2017-05-31 北京亚鸿世纪科技发展有限公司 A kind of monitoring administration method and device of website legitimacy
CN107122350B (en) * 2017-04-27 2021-02-05 北京易麦克科技有限公司 Method of multi-paragraph text feature extraction system
CN107122350A (en) * 2017-04-27 2017-09-01 北京易麦克科技有限公司 A kind of feature extraction system and method for many paragraph texts
CN109274632A (en) * 2017-07-12 2019-01-25 中国移动通信集团广东有限公司 A kind of recognition methods of website and device
CN109274632B (en) * 2017-07-12 2021-05-11 中国移动通信集团广东有限公司 Website identification method and device
CN110020075A (en) * 2017-10-20 2019-07-16 南京烽火软件科技有限公司 Device is excavated in illegal website automatically
CN107766551A (en) * 2017-10-31 2018-03-06 广东小天才科技有限公司 Website auditing and controlling method based on big data analysis and terminal equipment
CN107835197A (en) * 2017-12-15 2018-03-23 江苏盖亚建筑工程有限公司 A kind of network transmission system
CN108153872A (en) * 2017-12-25 2018-06-12 佛山市车品匠汽车用品有限公司 A kind of method and apparatus of the Internet web page information filtering
CN109688205A (en) * 2018-12-07 2019-04-26 麒麟合盛网络技术股份有限公司 The hold-up interception method and device of web page resources
CN109688205B (en) * 2018-12-07 2021-06-22 麒麟合盛网络技术股份有限公司 Webpage resource interception method and device
CN109743309A (en) * 2018-12-28 2019-05-10 微梦创科网络科技(中国)有限公司 A kind of illegal request recognition methods, device and electronic equipment
CN109743309B (en) * 2018-12-28 2021-09-10 微梦创科网络科技(中国)有限公司 Illegal request identification method and device and electronic equipment
CN111382061B (en) * 2018-12-29 2024-05-17 北京搜狗科技发展有限公司 Test method, test device, test medium and electronic equipment
CN111382061A (en) * 2018-12-29 2020-07-07 北京搜狗科技发展有限公司 Test method, test device, test medium and electronic equipment
CN110110195A (en) * 2019-05-07 2019-08-09 宜人恒业科技发展(北京)有限公司 A kind of impurity sweep-out method and device
CN110750639A (en) * 2019-07-02 2020-02-04 厦门美域中央信息科技有限公司 Text classification and R language realization based on vector space model
CN111259237A (en) * 2020-01-13 2020-06-09 中国搜索信息科技股份有限公司 Method for identifying public harmful information
CN111741007A (en) * 2020-07-06 2020-10-02 桦蓥(上海)信息科技有限责任公司 Financial business real-time monitoring system and method based on network layer message analysis
CN114024947B (en) * 2022-01-05 2022-04-01 北京微步在线科技有限公司 Web access method and device based on browser
CN114024947A (en) * 2022-01-05 2022-02-08 北京微步在线科技有限公司 Web access method and device based on browser
CN116502009A (en) * 2023-06-25 2023-07-28 北京奇虎科技有限公司 Webpage filtering method, device, equipment and storage medium
CN116502009B (en) * 2023-06-25 2023-10-31 北京奇虎科技有限公司 Webpage filtering method, device, equipment and storage medium
CN117176483A (en) * 2023-11-03 2023-12-05 北京艾瑞数智科技有限公司 Abnormal URL identification method and device and related products

Also Published As

Publication number Publication date
CN101035128B (en) 2010-04-21

Similar Documents

Publication Publication Date Title
CN101035128A (en) Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
CN104125209B (en) Malice website prompt method and router
CN101655868B (en) Network data mining method, network data transmitting method and equipment
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
CN103810425B (en) The detection method of malice network address and device
CN102054028B (en) Method for implementing web-rendering function by using web crawler system
US8812949B2 (en) System and method for automatically detecting and extracting semantically significant text from a HTML document associated with a plurality of HTML documents
CN102779170B (en) System and method for identifying text floor of webpage
CN1912869A (en) Implementing method of network profile
CN1955963A (en) System and method for searching dates in electronic documents
CN104504150A (en) News public opinion monitoring system
CN101909079A (en) User online behavior data acquisition method in backbone link and system
CN103577755A (en) Malicious script static detection method based on SVM (support vector machine)
CN103559235A (en) Online social network malicious webpage detection and identification method
CN107577788B (en) E-commerce website topic crawler method for automatically structuring data
CN101075909A (en) Method and system for accounting webstation access information
CN110457579B (en) Webpage denoising method and system based on cooperative work of template and classifier
CN105912684B (en) The cross-media retrieval method of view-based access control model feature and semantic feature
WO2013189254A1 (en) Hotspot aggregation method and device
CN103064984A (en) Spam webpage identifying method and spam webpage identifying system
CN101071445A (en) Classified sample set optimizing method and content-related advertising server
US11334592B2 (en) Self-orchestrated system for extraction, analysis, and presentation of entity data
KR20090130364A (en) Method, apparatus and computer-readable recording medium for tagging image contained in web page and providing web search service using tagged result
WO2017000659A1 (en) Enriched uniform resource locator (url) identification method and apparatus
CN108183902B (en) Malicious website identification method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100421

Termination date: 20180418