CN102117339A - Filter supervision method specific to unsecure web page texts - Google Patents

Filter supervision method specific to unsecure web page texts Download PDF

Info

Publication number
CN102117339A
CN102117339A CN201110083908XA CN201110083908A CN102117339A CN 102117339 A CN102117339 A CN 102117339A CN 201110083908X A CN201110083908X A CN 201110083908XA CN 201110083908 A CN201110083908 A CN 201110083908A CN 102117339 A CN102117339 A CN 102117339A
Authority
CN
China
Prior art keywords
text
web page
theme
concept
dangerous
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201110083908XA
Other languages
Chinese (zh)
Inventor
曹晓晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201110083908XA priority Critical patent/CN102117339A/en
Publication of CN102117339A publication Critical patent/CN102117339A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a filter supervision method specific to unsecure web page texts in the technical field of network and information security. The method comprises the following steps of: 1, storing concept characteristic vectors and characteristic values of multiple different theme texts in a concept library; 2, capturing text data through network nodes, and pre-processing the text data, wherein the pre-processing comprises the steps of word division and removal of meaningless words; 3, judging whether the themes of web page texts belong to sensitive themes; 4, judging which themes the contents of the themes belong to, and counting the concept characteristic vectors and calculating the characteristic values by using the same method in the step 1; 5, classifying the web page texts; and 6, saving the web page address into a supervision database according to the filter result, and directly setting the web page in back of a firewall by an administrator, wherein all requests for accessing the web page are automatically shielded, so that the purpose of web page information supervision is fulfilled. By using the method, the accuracy can be increased, and the filter speed is quickened.

Description

Filtering supervision method at dangerous webpage text
Technical field
The present invention relates to the method for a kind of network and information security field technical field, specifically is a kind of filtering supervision method at dangerous webpage text.
Background technology
Along with the fast development of internet, the information on the Internet presents diversity, increases sharply to 2,400,000,000 according to 05 year Chinese web page sum of Baidu's statistics; Cheap simultaneously memory device has also quickened the development of information carriers such as text, picture.Among the so many information, exist harmful webpages such as a large amount of violences, pornographic, reaction.No matter the existence of these webpages still to the obtaining of effective information all is an obstacle to social stability.Therefore how filtering out harmful information from immense information as cigarette, is a problem that presses for solution.
At present following several method is adopted in the information filtering on the internet mostly: based on grade labelling, URL and keyword filter analysis.Wherein, be that the information publisher has good self-disciplining based on the effective condition precedent of the monitoring and managing method of grade labelling; Based on the effective condition precedent of the method for url filtering is to know the network address that contains non-safety information in advance; The method of filtering based on keyword can not be understood the implication of text message, and effect is not very desirable usually.And filter method is being the problem that causes poor effect owing to proper vector is similar aspect a certain theme, for example, is the article at Falun Gong equally, and one piece may be to set forth Falun Gong to social harm, is the purpose of criticism; Another piece may be exactly the article of carrying forward Falun Gong.Yet they may have similar vector representation.Therefore cross and filter, will produce erroneous judgement by the form of key word.
Find through literature search prior art, " a kind of high performance two class Chinese text sorting techniques " that Fan Xinghua etc. deliver on rolling up 124 pages in " Chinese journal of computers " first phase the 9th in 2006, two yuan of speech strings that propose in this article are that feature is classified, its deficiency is not consider factors such as word distance, and the eigenvalue calculation method of wherein mentioning is at two yuan of phrases and improper.
Summary of the invention
The present invention is directed to above-mentioned deficiency of the prior art, a kind of filtering supervision method at dangerous webpage is provided, make it consider the semanteme of text, substitute the proper vector of traditional keyword with the concept characteristic vector, improve filter effect on the one hand, accelerate filter velocity because proper vector reduces on the other hand.
The present invention is achieved by the following technical solutions, the present invention includes following concrete steps:
Step 1 at different themes text statistic concept proper vector, computation of characteristic values, stores concept characteristic vector, the eigenwert of multiple different subject text into conceptual base;
Described concept characteristic vector, eigenwert with multiple different subject text stores conceptual base into, be meant: setting certain subject text has dangerous text and two kinds of language materials of normal text, difference statistic concept proper vector, computation of characteristic values, and with the just collection and negative the concentrating of the concept characteristic vector of dangerous text and normal text, corresponding theme that eigenwert stores conceptual base respectively into, text to different themes is all as above operated, final conceptual base has comprised the multiple different theme about dangerous text, and each part all comprises just collecting accordingly and bear and collects part;
Described statistic concept proper vector, be meant and count all two words that between several word distances, occur simultaneously, even the distance between two words is no more than several words, then 2 tuples formed of these two words are counted 1 concept characteristic vector, the correlativity between the concept characteristic vector is based on before and after the word;
Described eigenwert, because proper vector enormous amount, correlativity between each proper vector is little, the size of eigenwert is mainly by this proper vector frequency of occurrences and 2 yuan of word frequency dependences that constitute this proper vector, the log of eigen vector frequency is directly proportional, with 2 yuan of phrase frequencies be inversely proportional to.
Step 2 is used the text transfer protocol under the http protocol to catch text data by network node, and it is carried out pre-service, and pre-service comprises participle, rejects insignificant word;
Step 3 judges whether the theme through the pretreated web page text of step 2 belongs to responsive theme;
Describedly judge whether to belong to responsive theme, be meant by the mode of key word to judge whether the theme of pretreated web page text is relevant with dangerous content topic, if do not belong to then judge end; Otherwise, judge to belong to which aspect theme of conceptual base, and this web page text be referred in the corresponding theme of conceptual base.For example comprise key words such as Falun Gong, just the text is referred to the Falun Gong theme part of conceptual base, carry out the step 4 operation.
Step 4 if the theme of web page text belongs to responsive theme, judges which theme is the content of this topic belong to, and according to step 1 in identical method statistic concept characteristic vector, computation of characteristic values;
Step 5, step 4 is calculated the concept characteristic vector sum eigenwert of web page text, carry out similarity calculating with the proper vector and the eigenwert of corresponding theme in the conceptual base, employing VSM, SVM or KNN sorting technique find the class with web page text similarity maximum, and this web page text is categorized in such, the character of the text depends on the character of class, if such classification for needing to filter, then the text should be filtered, and continues to be handled by step 6;
Step 6 according to filter result, places the supervision database with this web page address, and the keeper directly places this webpage after the fire wall afterwards, and the request of this webpage of all-access is when shielding automatically, thereby realizes the purpose of info web supervision.
Compared with prior art, the present invention has following beneficial effect: the invention provides a kind of new information filtering method, substitute the proper vector of traditional keyword with the concept characteristic vector.Can improve filter effect on the one hand, common sorting algorithm accuracy rate is about 80%, and accuracy rate of the present invention can be accelerated filter velocity because proper vector reduces on the other hand about 92%, and per second can be handled about 500 pieces on the speed.
Description of drawings
Fig. 1 is a workflow diagram of the present invention.
Embodiment
Below in conjunction with accompanying drawing embodiments of the invention are elaborated: present embodiment is being to implement under the prerequisite with the technical solution of the present invention, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.
The present embodiment filtering supervision includes the dangerous webpage of Falun Gong information.
As shown in Figure 1, this enforcement comprises following concrete steps:
1. collect two class texts of pro and con at the Falun Gong theme, comprise the advocation of relevant Falun Gong and attack two aspects, calculate their proper vector and eigenwert then, and place conceptual base positive and negative two parts about the Falun Gong theme.
Described statistic concept proper vector, be meant and count all two words that between N word distance, occur simultaneously, even the distance between two words is no more than N word, then 2 tuples formed of these two words are counted 1 concept characteristic vector, the correlativity between the concept characteristic vector is based on before and after the word; As, Falun Gong harm society, Falun Gong harm, harm society are exactly proper vector;
Described computation of characteristic values, be meant: eigenwert is represented the correlativity of two phrases, N1 is the frequency of phrase 1, N2 is the frequency of phrase 2, N is the frequency that N1 and N2 occur simultaneously, then this eigenwert just is expressed as log (N)/(N1+N2), and eigenwert is represented the contribution of a proper vector to classification, and the big more expression resolution of eigenwert is good more.
By network node, use the text transfer protocol under the http protocol to catch text data, then, depend on coding and format conversion scheme the text data that obtains is carried out format conversion and code conversion, and carry out participle, reject insignificant vocabulary;
3. adopt the mode of keyword matching to judge whether web page text belongs to the Falun Gong sensitive subjects, the keyword of sensitive subjects is formulated by the keeper, keyword is stored in the database, just can judge whether to belong to sensitive subjects by Query Database, if not finishing judgement; If carry out following processing;
4. add up the proper vector of the text and calculate the characteristic of correspondence value, and judge which topic is the content of this topic belong to, and the text that for example comprises wordings such as Falun Gong belongs to the Falun Gong topic;
5. the eigenwert of individual features vector in this topic in the query concept storehouse, this topic has different proper vector of two classes and eigenwert in conceptual base, represent pro and con attitude respectively to this topic, afterwards by adopting VSM, SVM, the KNN sorting technique calculates and treats the classification of classifying text similarity maximum----Falun Gong propaganda classification, then the text is just differentiated the article for the Falun Gong propaganda, classification results is used for supervision, and carries out a following step and handle;
6. according to filter result, this web page address is placed the supervision database, the keeper directly places this webpage after the fire wall afterwards, and the request of this webpage of all-access is when shielding automatically, thereby realizes the purpose of info web supervision.
Compared with prior art, present embodiment has following beneficial effect: present embodiment provides a kind of new information filtering method, substitute the proper vector of traditional keyword with the concept characteristic vector, the accuracy rate of present embodiment is about 92%, and the present embodiment per second can be handled about 500 pieces in addition.

Claims (5)

1. the filtering supervision method at dangerous webpage text is characterized in that, comprises following concrete steps:
Step 1 at different themes text statistic concept proper vector, computation of characteristic values, stores concept characteristic vector, the eigenwert of multiple different subject text into conceptual base;
Step 2 is used the text transfer protocol under the http protocol to catch text data by network node, and it is carried out pre-service, and pre-service comprises participle, rejects insignificant word;
Step 3 judges whether the theme through the pretreated web page text of step 2 belongs to responsive theme;
Step 4 if the theme of web page text belongs to responsive theme, judges which theme is the content of this topic belong to, and according to step 1 in identical method statistic concept characteristic vector, computation of characteristic values;
Step 5, step 4 is calculated the concept characteristic vector sum eigenwert of web page text, carry out similarity calculating with the proper vector and the eigenwert of corresponding theme in the conceptual base, employing VSM, SVM or KNN sorting technique find the class with web page text similarity maximum, and this web page text is categorized in such, the character of the text depends on the character of class, if such classification for needing to filter, then the text should be filtered, and continues to be handled by step 6;
Step 6 according to filter result, places the supervision database with this web page address, and the keeper directly places this webpage after the fire wall afterwards, and the request of this webpage of all-access is when shielding automatically, thereby realizes the purpose of info web supervision.
2. the filtering supervision method at dangerous webpage text according to claim 1, it is characterized in that, described concept characteristic vector with multiple different subject text, eigenwert stores conceptual base into, be meant: setting certain subject text has dangerous text and two kinds of language materials of normal text, difference statistic concept proper vector, computation of characteristic values, and with the concept characteristic vector of dangerous text and normal text, eigenwert stores the just collection and negative the concentrating of the corresponding theme of conceptual base respectively into, text to different themes is all as above operated, final conceptual base has comprised the multiple different theme about dangerous text, and each part all comprises just collecting accordingly and bear and collects part.
3. the filtering supervision method at dangerous webpage text according to claim 1 and 2, it is characterized in that, described statistic concept proper vector, be meant and count all two words that between several word distances, occur simultaneously, even the distance between two words is no more than several words, then 2 tuples formed of these two words are counted 1 concept characteristic vector, the correlativity between the concept characteristic vector is based on before and after the word.
4. the filtering supervision method at dangerous webpage text according to claim 1 and 2, it is characterized in that, described eigenwert, itself and the proper vector frequency of occurrences and constitute 2 yuan of word frequency dependences of this proper vector, the log of eigen vector frequency is directly proportional, with 2 yuan of phrase frequencies be inversely proportional to.
5. the filtering supervision method at dangerous webpage text according to claim 1, it is characterized in that, describedly judge whether to belong to responsive theme, be meant by the mode of key word to judge whether the theme of pretreated web page text is relevant with dangerous content topic, if do not belong to then judge end; Otherwise, judge to belong to which aspect theme of conceptual base, and this web page text be referred in the corresponding theme of conceptual base.
CN201110083908XA 2011-03-30 2011-03-30 Filter supervision method specific to unsecure web page texts Pending CN102117339A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110083908XA CN102117339A (en) 2011-03-30 2011-03-30 Filter supervision method specific to unsecure web page texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110083908XA CN102117339A (en) 2011-03-30 2011-03-30 Filter supervision method specific to unsecure web page texts

Publications (1)

Publication Number Publication Date
CN102117339A true CN102117339A (en) 2011-07-06

Family

ID=44216109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110083908XA Pending CN102117339A (en) 2011-03-30 2011-03-30 Filter supervision method specific to unsecure web page texts

Country Status (1)

Country Link
CN (1) CN102117339A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102542063A (en) * 2011-12-30 2012-07-04 华为技术有限公司 Content filtering method, device and system
CN102693279A (en) * 2012-04-28 2012-09-26 合一网络技术(北京)有限公司 Method, device and system for fast calculating comment similarity
CN103501322A (en) * 2013-09-24 2014-01-08 长沙裕邦软件开发有限公司 Website sharing platform and method for achieving website sharing
CN103902619A (en) * 2012-12-28 2014-07-02 中国移动通信集团公司 Internet public opinion monitoring method and system
CN107357801A (en) * 2017-05-18 2017-11-17 辛柯俊 A kind of enterprise's related web page theme measuring method and system
CN107679075A (en) * 2017-08-25 2018-02-09 北京德塔精要信息技术有限公司 Method for monitoring network and equipment
CN107688576A (en) * 2016-08-04 2018-02-13 中国科学院声学研究所 The structure and tendentiousness sorting technique of a kind of CNN SVM models
CN108153872A (en) * 2017-12-25 2018-06-12 佛山市车品匠汽车用品有限公司 A kind of method and apparatus of the Internet web page information filtering
CN110209796A (en) * 2019-04-29 2019-09-06 北京印刷学院 A kind of sensitive word detection filter method, device and electronic equipment
CN110457428A (en) * 2019-06-26 2019-11-15 北京印刷学院 A kind of sensitive word detection filter method, device and electronic equipment
CN111563276A (en) * 2019-01-25 2020-08-21 深信服科技股份有限公司 Webpage tampering detection method, detection system and related equipment
CN116502009A (en) * 2023-06-25 2023-07-28 北京奇虎科技有限公司 Webpage filtering method, device, equipment and storage medium

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013097597A1 (en) * 2011-12-30 2013-07-04 华为技术有限公司 Content filtering method, device and system
CN102542063B (en) * 2011-12-30 2015-04-29 华为技术有限公司 Content filtering method, device and system
CN102542063A (en) * 2011-12-30 2012-07-04 华为技术有限公司 Content filtering method, device and system
CN102693279A (en) * 2012-04-28 2012-09-26 合一网络技术(北京)有限公司 Method, device and system for fast calculating comment similarity
CN102693279B (en) * 2012-04-28 2014-09-03 合一网络技术(北京)有限公司 Method, device and system for fast calculating comment similarity
CN103902619B (en) * 2012-12-28 2018-10-23 中国移动通信集团公司 A kind of network public-opinion monitoring method and system
CN103902619A (en) * 2012-12-28 2014-07-02 中国移动通信集团公司 Internet public opinion monitoring method and system
CN103501322A (en) * 2013-09-24 2014-01-08 长沙裕邦软件开发有限公司 Website sharing platform and method for achieving website sharing
CN107688576B (en) * 2016-08-04 2020-06-16 中国科学院声学研究所 Construction and tendency classification method of CNN-SVM model
CN107688576A (en) * 2016-08-04 2018-02-13 中国科学院声学研究所 The structure and tendentiousness sorting technique of a kind of CNN SVM models
CN107357801B (en) * 2017-05-18 2021-05-28 辛柯俊 Enterprise related webpage theme measuring method and system
CN107357801A (en) * 2017-05-18 2017-11-17 辛柯俊 A kind of enterprise's related web page theme measuring method and system
CN107679075A (en) * 2017-08-25 2018-02-09 北京德塔精要信息技术有限公司 Method for monitoring network and equipment
CN107679075B (en) * 2017-08-25 2020-06-02 北京德塔精要信息技术有限公司 Network monitoring method and equipment
CN108153872A (en) * 2017-12-25 2018-06-12 佛山市车品匠汽车用品有限公司 A kind of method and apparatus of the Internet web page information filtering
CN111563276B (en) * 2019-01-25 2024-04-09 深信服科技股份有限公司 Webpage tampering detection method, detection system and related equipment
CN111563276A (en) * 2019-01-25 2020-08-21 深信服科技股份有限公司 Webpage tampering detection method, detection system and related equipment
CN110209796A (en) * 2019-04-29 2019-09-06 北京印刷学院 A kind of sensitive word detection filter method, device and electronic equipment
CN110209796B (en) * 2019-04-29 2022-02-08 北京印刷学院 Sensitive word detection and filtering method and device and electronic equipment
CN110457428B (en) * 2019-06-26 2023-07-04 北京印刷学院 Sensitive word detection and filtering method and device and electronic equipment
CN110457428A (en) * 2019-06-26 2019-11-15 北京印刷学院 A kind of sensitive word detection filter method, device and electronic equipment
CN116502009A (en) * 2023-06-25 2023-07-28 北京奇虎科技有限公司 Webpage filtering method, device, equipment and storage medium
CN116502009B (en) * 2023-06-25 2023-10-31 北京奇虎科技有限公司 Webpage filtering method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN102117339A (en) Filter supervision method specific to unsecure web page texts
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
CN103336766B (en) Short text garbage identification and modeling method and device
US9164980B2 (en) Name identification rule generating apparatus and name identification rule generating method
CN103324745B (en) Text garbage recognition methods and system based on Bayesian model
WO2016058267A1 (en) Chinese website classification method and system based on characteristic analysis of website homepage
WO2014056397A1 (en) Label of interest recommendation method, system and computer readable medium
CN103577755A (en) Malicious script static detection method based on SVM (support vector machine)
CN101231661A (en) Method and system for digging object grade knowledge
CN111324801B (en) Hot event discovery method in judicial field based on hot words
CN102629282A (en) Website classification method, device and system
Man Feature extension for short text categorization using frequent term sets
CN108846117A (en) The duplicate removal screening technique and device of business news flash
CN104899324A (en) Sample training system based on IDC (internet data center) harmful information monitoring system
CN112492606B (en) Classification recognition method and device for spam messages, computer equipment and storage medium
CN104268230A (en) Method for detecting objective points of Chinese micro-blogs based on heterogeneous graph random walk
WO2015014221A1 (en) Trash information filtering method and device
CN110020161B (en) Data processing method, log processing method and terminal
CN104636386A (en) Information monitoring method and device
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
CN111460803B (en) Equipment identification method based on Web management page of industrial Internet of things equipment
US8108391B1 (en) Identifying non-compositional compounds
Zhang et al. A hot spot clustering method based on improved kmeans algorithm
JP5477910B2 (en) Text search program, device, server and method using search keyword dictionary and dependency keyword dictionary
CN109063117B (en) Network security blog classification method and system based on feature extraction

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20110706