CN101055621A - Content based sensitive web page identification method - Google Patents
Content based sensitive web page identification method Download PDFInfo
- Publication number
- CN101055621A CN101055621A CN 200610073172 CN200610073172A CN101055621A CN 101055621 A CN101055621 A CN 101055621A CN 200610073172 CN200610073172 CN 200610073172 CN 200610073172 A CN200610073172 A CN 200610073172A CN 101055621 A CN101055621 A CN 101055621A
- Authority
- CN
- China
- Prior art keywords
- text
- image
- identification
- responsive
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 230000004927 fusion Effects 0.000 claims abstract description 6
- 238000012706 support-vector machine Methods 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 4
- 239000004744 fabric Substances 0.000 claims description 4
- 238000002203 pretreatment Methods 0.000 claims description 3
- 230000035945 sensitivity Effects 0.000 claims description 2
- 238000001914 filtration Methods 0.000 description 7
- 238000010606 normalization Methods 0.000 description 3
- 238000006467 substitution reaction Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
Images
Landscapes
- Character Input (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a method for identifying the sensitive webpage based on the content, including the steps of: under the condition of uniform resource localizer of given webpage, acquiring the source code of the webpage, splitting-flow for data and pretreating, and obtaining the text message and the effective image information; treating the text message using the continuous sensitive text classifier, and the treatment being completed if the results outputted from the classifier is bigger than the given threshold value in advance. Otherwise, the text messge may be treated using the dispersing sensitive text classifier, the results of identification is sentitive if the results outputted from the classifier is bigger than the given threshold value in advance, and the treatment is completed. Otherwise the image is identified using the image classifier, and the results of identification are mixed together with the results outputted from the dispersing classifier. The invention, using the combination of the continuous sensitive text classifier, the dispersing sensitive text classifier and the sensitive image claasifier, solves the existing technical problems. The invention, using web structural information and constructing an image collection identification, can carry out the fusion of information to improve the discrimination for sensitive webpage.
Description
Technical field
The present invention relates to the information filtering technical field, relate in particular to the method that identification contains the webpage of sensitive information.
Background technology
Because the internet sensitive information has caused great harm for Internet user especially teenager, therefore caused the extensive concern of researcher and industry.
A variety of sensitive information filter methods are arranged at present, comprise black and white lists, IP filtration and keyword coupling or the like filtration means.Generally speaking; on the one hand; these filtering techniques adopt a kind of very mechanical mode; can reach 100% filtration efficiency to some sensitive web pages; response time is also very short; but the cycle that filtration parameter upgrades can only followed the appearance of actual sensitive web page and changed, and can not tackle the quick variation of actual responsive website.On the other hand,, therefore caused very high mistake filterability, influenced user's normal online because the content information of webpage does not utilize basically or seldom utilizes.
Content-based sensitive information intelligent identification technology is a developing direction of filtering technique in recent years.At present existing multiple content-based sensitive information recognition methods.
On the responsive text identification of the general main foundation of the present sensitive web page identification method basis.Therefore core is the processing to text, at first extracts the text in the webpage, extracts feature then, utilizes the sorting algorithm of machine learning the inside to come feature is trained and classified then.What wherein feature extracting methods adopted usually is: (1) artificial given lists of keywords; (2) utilize the method for text matches to add up the number of times that each keyword occurs; (3) number of times of each keyword appearance is formed a vector, and after processing such as normalization, this vector is as the proper vector of the text.General given keyword number is less than 100.Choosing sorter then trains and predicts.People such as Singapore Pui Y.Lee utilize the Kohonen self organizing neural network to be used as sorter, have obtained actual effect preferably.Also have some sensitive image recognition methodss, for example our unit has proposed a kind of content-based sensitive image recognition methods, has obtained to surpass 80% discrimination on the CAMPAQ database.
Filter method with machinery is similar, above method is not well utilized the web feature, can't reach satisfied effect at present, for example the identification of text based sensitive web page can not well be discerned the normal webpage relevant with responsive theme, and the false recognition rate of discerning based on the sensitive web page of image is very high.Already present blending algorithm also only be by with or the operation merge, can not fundamentally improve discrimination.
In order to solve the deficiencies in the prior art, the objective of the invention is to pay close attention to the sensitive information identification of carrying out from web webpage characteristics, further improve discrimination to sensitive web page, for this reason, the present invention proposes a kind of content-based sensitive web page identification method.
To achieve these goals, it is as follows to the present invention is based on the step of sensitive web page identification method of content: comprise pre-treatment step and identification text message step;
Pre-treatment step comprises:
Under the condition of the uniform resource locator of given webpage, obtain the source code of this webpage, carry out data distribution and pre-service, obtain text message;
Obtain image section structural information in the webpage, select significance map and look like to form effective image collection;
Identification sensitive information step comprises:
Utilize continuous responsive text identification device that text message is discerned treatment step;
Utilize the discrete text recognizer that text message is carried out identification step;
Utilize the sensitive image recognizer that the image of image collection is carried out identification step.
Described identification sensitive information step is as follows:
Utilize continuous responsive text identification device that text message is discerned processing,, then dispose if recognition result is responsive; If recognition result is insensitive, then carry out:
The discrete text recognizer carries out identification step to text message, if recognizer is exported the result greater than threshold value, then recognition result is responsive, disposes; If recognition result is insensitive, then carry out:
The sensitive image recognizer carries out identification step to the image of image collection, and the result of identification and the result of discrete responsive text identification device merge, and judges according to its fusion results whether this webpage is responsive.
The present invention is directed in the prior art, the identification of text based sensitive web page can not well be discerned the normal webpage relevant with responsive theme; Based on the sensitive web page of image identification be adopt with or the technical scheme that merges of operation, can not fundamentally improve the problem of discrimination, the present invention adopts the technical scheme of continuous responsive text identification device, discrete text recognizer and the triplicity of sensitive image recognizer to solve prior art problems, the present invention utilizes the web structural information and has constructed an image collection identification problem and carried out information fusion, improves the discrimination to sensitive web page.
Description of drawings
By the detailed description below in conjunction with accompanying drawing, above-mentioned and others, feature and advantage of the present invention will become more apparent.In the accompanying drawing:
Fig. 1 is a system framework synoptic diagram of the present invention
Embodiment
Below in conjunction with accompanying drawing the present invention is specified.Be noted that the described example of executing only is considered as illustrative purposes, rather than limitation of the present invention.
According to the present invention, shown Fig. 1 is a system framework synoptic diagram of the present invention, and concrete steps are as follows:
At step S1: the source code that obtains given webpage URL;
At step S2: isolate the Chinese text in the source code;
At step S3: obtain the size information of image in the source code, weed out parts of images according to rule;
At step S4: utilize the continuous text sorter that the Chinese text of separating is discerned, recognition result is 1, and this webpage is responsive, then withdraws from;
At step S5: utilize the discrete text sorter that Chinese text is discerned, if recognition result greater than setting threshold, this webpage is responsive, then withdraws from;
At step S6: utilize the image classification device that image is discerned;
At step S7: the result of the result of identification and discrete text identification merges.
According to step S3, pick out important image step and comprise:
Obtain this webpage and comprise every width of cloth size of images information;
If the picture size size meets the good rule of prior statistics, this image is considered as the significance map picture, then is divided in effective image collection.
According to step S4, utilize continuous responsive text identification device identification text step to comprise:
Extract the feature of the text;
Text feature is input in the support vector machine (Support VectorMachine is called for short SVM) that has trained in advance, and the output result is that 1 text is responsive, disposes, otherwise continues to handle.
According to step S5, utilize discrete responsive text identification device identification text step to comprise:
Utilize vector space model (VSM) to extract the feature of the text;
Text feature is input in the Bayesian network that trained (Bayes Networks is called for short BNS), and the result of output is the responsive probability of text input, if probable value greater than threshold tau, then text be responsive, disposes, otherwise the continuation processing.
According to step S6, the image recognition step comprises:
Utilize the image recognition device that every width of cloth image is discerned, recognition result is N for responsive amount of images
1, recognition result is that normal amount of images is N
2
According to step S7, the information fusion step comprises:
The result of discrete text identification and the result of step S6 image recognition merge, in the formula of substitution as a result (1-1) of identification, if the result greater than 1, then this webpage be a sensitivity, otherwise is normally, disposes.
In the inventive method step S1 and step S2,, the web webpage is divided three classes based on analysis to web.The first kind is the webpage based on continuous text, and wherein continuous text is defined as the text of article character, and being characterized in has stronger semantic association between the context, have abundant semantic information to utilize.The type webpage has one piece or several pieces of articles usually.Second class is the webpage based on discrete text, and wherein discrete text refers to continuous text text in addition, and for example explanatory text around homepage or some pictures or the like mainly plays link or illustration.The 3rd class is meant the webpage based on image, and what mainly present in the webpage is image information, and adding has a spot of discrete text.
Particularly, the present invention is for the webpage of the first kind, and continuous text is main, selects for use in conjunction with filter method semantic and statistics, has defined three class keywords and has provided descriptive definition:
The first kind is explicit keyword, and this class keyword only may appear at responsive text the inside basically, statistically is exactly the probability very big (approaching 1) that appears at responsive text the inside, and appears at the probability very little (approaching 0) inside the normal text.From semantically, itself is just carrying sensitive information these speech.
Second class is the implicit expression keyword, and this class keyword did not carry any sensitive information originally.But for a certain reason, this class speech in responsive text generating fixing contact, that is to say that these speech also are to occur with very big probability in responsive text the inside, also can occur certainly in other text the inside.
The 3rd class formula logic keyword, this class keyword is divided into two classes: a class is a polysemant, promptly this class keyword is normal in normal text the inside meaning, carries sensitive information in responsive text the inside; An other class keyword mainly be that certain speech is arranged in pairs or groups after, carrying sensitive information jointly.And this collocation, we can be divided into two kinds, and a kind of is the explicit logic that adds, and a kind of is the logical add logic.Based on above-mentioned definition, chosen keyword set, make up semantic rules simultaneously and described semantic association between the vocabulary, help correct characteristic information extraction.Feature after proposing is through after the normalization, as the proper vector of this continuous text.By step S4, select for use support vector machine (Support Vector Machine, be called for short SVM) as sorter, feature is trained and classified, output decides whether this webpage is sensitive web page according to SVM.
Particularly, the present invention is for the webpage of second type, according to step S4, an artificial constructed lists of keywords, behind the statistics of the text in webpage keyword, be input to the Bayes network the inside that trains as proper vector after the normalization, decide according to the output of network whether this webpage is sensitive web page.
Particularly, the present invention by step S3, obtains the satisfactory image of part of webpage the inside for the webpage of the 3rd type according to size; By step S6, utilize the image classification device that image is discerned one by one, the result of identification is (N
1, N
2), N wherein
1For recognition result is responsive image number, N
2For recognition result is normal image number.Whether be responsive priori as image simultaneously,, use and text is differentiated that the result of output is: P to the text of webpage the inside at the Bayes sorter of discrete text according to step S5
sAccording to step S7, utilize two parameters to describe image classification device: P
1Represent a secondary normal picture mistake is divided into the probability of sensitive image, P
2Represent a secondary sensitive image mistake is divided into the probability of normal picture, three following formula of parameter substitution merge:
The above-mentioned formula of each sorter output valve substitution, result calculated and threshold judge whether this webpage is sensitive web page.
In the foregoing description, each step is example, and those of ordinary skills can determine the actual step that will use according to actual conditions, and the realization of each step has several different methods, all should belong within the scope of the present invention.
Explanation at last: top description is to be used to realize the present invention and embodiment, and scope of the present invention should not described by this and limit.It should be appreciated by those skilled in the art,, all belong to claim of the present invention and come restricted portion in any modification or partial replacement that does not depart from the scope of the present invention.
Claims (6)
1, a kind of content-based sensitive web page identification method comprises step:
Pre-treatment step comprises:
Under the condition of the uniform resource locator of given webpage, obtain the source code of this webpage, carry out data distribution and pre-service, obtain text message;
Obtain the structural information of image section in the webpage, select significance map and look like to form effective image collection;
Webpage sensitive information identification step comprises:
Utilize continuous responsive text identification device that text message is discerned treatment step;
Utilize the discrete text recognizer that text message is carried out identification step;
Utilize the sensitive image recognizer that the image of image collection is carried out identification step.
According to the described content-based sensitive web page identification method of claim 1, it is characterized in that 2, described identification sensitive information step is as follows:
Utilize continuous responsive text identification device that text message is discerned processing,, then dispose if recognition result is responsive; If recognition result is insensitive, then carry out:
The discrete text recognizer carries out identification step to text message, if recognizer is exported the result greater than threshold value, then recognition result is responsive, disposes; If recognition result is insensitive, then carry out:
The sensitive image recognizer carries out identification step to the image of image collection, and the result of identification and the result of discrete responsive text identification device merge, and judges according to its fusion results whether this webpage is responsive.
3, according to the described content-based sensitive web page identification method of claim 1, it is characterized in that, pick out important image step and comprise:
Obtain this webpage and comprise every width of cloth size of images information;
If the picture size size meets the good rule of prior statistics, this image is considered as the significance map picture, then is divided in effective image collection.
4, according to the described content-based sensitive web page identification method of claim 1, it is characterized in that, utilize continuous responsive text identification device identification text step to comprise:
Extract the feature of the text;
Text feature is input in the support vector machine that has trained in advance, and the output result is that 1 text is responsive, disposes, otherwise continues to handle.
5, according to the described content-based sensitive web page identification method of claim 1, it is characterized in that, utilize discrete responsive text identification device identification text step to comprise:
Utilize vector space model to extract the feature of the text;
Text feature is input in the Bayesian network that has trained, and the result of output is the responsive probability of text input, if probable value greater than threshold tau, then text be responsive, disposes, otherwise continues processing.
According to the described content-based sensitive web page identification method of claim 1, it is characterized in that 6, image recognition and information fusion step comprise:
Utilize the image recognition device that every width of cloth image is discerned, recognition result is N for responsive amount of images
1, recognition result is that normal amount of images is N
2
The result of discrete text identification and the result of above-mentioned image recognition merge, if the result greater than 1, then this webpage be a sensitivity, otherwise is normally, disposes.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2006100731727A CN100412888C (en) | 2006-04-10 | 2006-04-10 | Content based sensitive web page identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2006100731727A CN100412888C (en) | 2006-04-10 | 2006-04-10 | Content based sensitive web page identification method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101055621A true CN101055621A (en) | 2007-10-17 |
CN100412888C CN100412888C (en) | 2008-08-20 |
Family
ID=38795454
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2006100731727A Active CN100412888C (en) | 2006-04-10 | 2006-04-10 | Content based sensitive web page identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100412888C (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102332028A (en) * | 2011-10-15 | 2012-01-25 | 西安交通大学 | Webpage-oriented unhealthy Web content identifying method |
CN102541913A (en) * | 2010-12-15 | 2012-07-04 | 中国人民解放军国防科学技术大学 | Web-oriented VSM (vector space model) classifier training method, web-oriented OSSP (open resource software page) identifying method and Web-oriented OSS (open resource software) resource extracting method |
CN101763502B (en) * | 2008-12-24 | 2012-07-25 | 中国科学院自动化研究所 | High-efficiency method and system for sensitive image detection |
CN102647416A (en) * | 2012-03-30 | 2012-08-22 | 上海明复信息技术有限公司 | System and method for filtering harmful information based on internet data source control |
CN102693236A (en) * | 2011-03-24 | 2012-09-26 | 苏州风采信息技术有限公司 | Bad information filtering method based on content understanding |
CN103473299A (en) * | 2013-09-06 | 2013-12-25 | 北京锐安科技有限公司 | Website bad likelihood obtaining method and device |
CN103810425A (en) * | 2012-11-13 | 2014-05-21 | 腾讯科技(深圳)有限公司 | Method and device for detecting malicious website |
CN104391860A (en) * | 2014-10-22 | 2015-03-04 | 安一恒通(北京)科技有限公司 | Content type detection method and device |
CN104951802A (en) * | 2015-06-17 | 2015-09-30 | 中国科学院自动化研究所 | Classifier updating method |
CN105162652A (en) * | 2015-08-21 | 2015-12-16 | 成都秋雷科技有限责任公司 | Processing method for webpage browsing |
CN106682694A (en) * | 2016-12-27 | 2017-05-17 | 复旦大学 | Sensitive image identification method based on depth learning |
CN106992922A (en) * | 2014-05-15 | 2017-07-28 | 周奇 | The method of the subjective sensitive information of expression |
CN107943954A (en) * | 2017-11-24 | 2018-04-20 | 杭州安恒信息技术有限公司 | Detection method, device and the electronic equipment of webpage sensitive information |
CN108234392A (en) * | 2016-12-14 | 2018-06-29 | 北京国双科技有限公司 | The monitoring method and device of a kind of website |
CN109656141A (en) * | 2019-01-11 | 2019-04-19 | 武汉天喻聚联网络有限公司 | Violation identification and machine behaviour control method, equipment, storage medium based on artificial intelligence technology |
CN109902223A (en) * | 2019-01-14 | 2019-06-18 | 中国科学院信息工程研究所 | A kind of harmful content filter method based on multi-modal information feature |
CN110275958A (en) * | 2019-06-26 | 2019-09-24 | 北京市博汇科技股份有限公司 | Site information recognition methods, device and electronic equipment |
CN114021559A (en) * | 2021-11-02 | 2022-02-08 | 航天信息股份有限公司 | Asset confirmation method and device in network security inspection, electronic equipment and medium |
CN114782670A (en) * | 2022-05-11 | 2022-07-22 | 中航信移动科技有限公司 | Multi-mode sensitive information identification method, equipment and medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101359372B (en) * | 2008-09-26 | 2011-05-11 | 腾讯科技(深圳)有限公司 | Training method and device of classifier, method and apparatus for recognising sensitization picture |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1349180A (en) * | 2001-12-03 | 2002-05-15 | 上海交通大学 | Web page server based on content grading |
CN1508755A (en) * | 2002-12-17 | 2004-06-30 | 中国科学院自动化研究所 | Sensitive video-frequency detecting method |
JP3801138B2 (en) * | 2003-01-21 | 2006-07-26 | ブラザー工業株式会社 | Communication system, communication terminal, and communication program |
-
2006
- 2006-04-10 CN CNB2006100731727A patent/CN100412888C/en active Active
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101763502B (en) * | 2008-12-24 | 2012-07-25 | 中国科学院自动化研究所 | High-efficiency method and system for sensitive image detection |
CN102541913A (en) * | 2010-12-15 | 2012-07-04 | 中国人民解放军国防科学技术大学 | Web-oriented VSM (vector space model) classifier training method, web-oriented OSSP (open resource software page) identifying method and Web-oriented OSS (open resource software) resource extracting method |
CN102541913B (en) * | 2010-12-15 | 2017-10-03 | 中国人民解放军国防科学技术大学 | VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented |
CN102693236A (en) * | 2011-03-24 | 2012-09-26 | 苏州风采信息技术有限公司 | Bad information filtering method based on content understanding |
CN102332028A (en) * | 2011-10-15 | 2012-01-25 | 西安交通大学 | Webpage-oriented unhealthy Web content identifying method |
CN102647416A (en) * | 2012-03-30 | 2012-08-22 | 上海明复信息技术有限公司 | System and method for filtering harmful information based on internet data source control |
CN103810425A (en) * | 2012-11-13 | 2014-05-21 | 腾讯科技(深圳)有限公司 | Method and device for detecting malicious website |
CN103810425B (en) * | 2012-11-13 | 2015-09-30 | 腾讯科技(深圳)有限公司 | The detection method of malice network address and device |
US9935967B2 (en) | 2012-11-13 | 2018-04-03 | Tencent Technology (Shenzhen) Company Limited | Method and device for detecting malicious URL |
CN103473299A (en) * | 2013-09-06 | 2013-12-25 | 北京锐安科技有限公司 | Website bad likelihood obtaining method and device |
CN103473299B (en) * | 2013-09-06 | 2017-02-08 | 北京锐安科技有限公司 | Website bad likelihood obtaining method and device |
CN106992922A (en) * | 2014-05-15 | 2017-07-28 | 周奇 | The method of the subjective sensitive information of expression |
CN104391860A (en) * | 2014-10-22 | 2015-03-04 | 安一恒通(北京)科技有限公司 | Content type detection method and device |
CN104391860B (en) * | 2014-10-22 | 2018-03-02 | 安一恒通(北京)科技有限公司 | content type detection method and device |
CN104951802A (en) * | 2015-06-17 | 2015-09-30 | 中国科学院自动化研究所 | Classifier updating method |
CN105162652A (en) * | 2015-08-21 | 2015-12-16 | 成都秋雷科技有限责任公司 | Processing method for webpage browsing |
CN108234392A (en) * | 2016-12-14 | 2018-06-29 | 北京国双科技有限公司 | The monitoring method and device of a kind of website |
CN108234392B (en) * | 2016-12-14 | 2021-06-08 | 北京国双科技有限公司 | Website monitoring method and device |
CN106682694A (en) * | 2016-12-27 | 2017-05-17 | 复旦大学 | Sensitive image identification method based on depth learning |
CN107943954A (en) * | 2017-11-24 | 2018-04-20 | 杭州安恒信息技术有限公司 | Detection method, device and the electronic equipment of webpage sensitive information |
CN107943954B (en) * | 2017-11-24 | 2020-07-10 | 杭州安恒信息技术股份有限公司 | Method and device for detecting webpage sensitive information and electronic equipment |
CN109656141A (en) * | 2019-01-11 | 2019-04-19 | 武汉天喻聚联网络有限公司 | Violation identification and machine behaviour control method, equipment, storage medium based on artificial intelligence technology |
CN109902223A (en) * | 2019-01-14 | 2019-06-18 | 中国科学院信息工程研究所 | A kind of harmful content filter method based on multi-modal information feature |
CN110275958A (en) * | 2019-06-26 | 2019-09-24 | 北京市博汇科技股份有限公司 | Site information recognition methods, device and electronic equipment |
CN110275958B (en) * | 2019-06-26 | 2021-07-27 | 北京市博汇科技股份有限公司 | Website information identification method and device and electronic equipment |
CN114021559A (en) * | 2021-11-02 | 2022-02-08 | 航天信息股份有限公司 | Asset confirmation method and device in network security inspection, electronic equipment and medium |
CN114782670A (en) * | 2022-05-11 | 2022-07-22 | 中航信移动科技有限公司 | Multi-mode sensitive information identification method, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN100412888C (en) | 2008-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101055621A (en) | Content based sensitive web page identification method | |
CN100565523C (en) | A kind of filtering sensitive web page method and system based on multiple Classifiers Combination | |
US8078625B1 (en) | URL-based content categorization | |
CN104679825B (en) | Macroscopic abnormity of earthquake acquisition of information based on network text and screening technique | |
CN106126502B (en) | A kind of emotional semantic classification system and method based on support vector machines | |
CN111324797B (en) | Method and device for precisely acquiring data at high speed | |
CN112541476B (en) | Malicious webpage identification method based on semantic feature extraction | |
CN110738033B (en) | Report template generation method, device and storage medium | |
CN104317891B (en) | A kind of method and device that label is marked to the page | |
CN111310476A (en) | Public opinion monitoring method and system using aspect-based emotion analysis method | |
CN101038596A (en) | Method and system for classifying website | |
CN112256861B (en) | Rumor detection method based on search engine return result and electronic device | |
CN103064984A (en) | Spam webpage identifying method and spam webpage identifying system | |
US8699796B1 (en) | Identifying sensitive expressions in images for languages with large alphabets | |
Wahsheh et al. | A link and content hybrid approach for Arabic web spam detection | |
Li et al. | Semantic‐enhanced multimodal fusion network for fake news detection | |
CN114003803B (en) | Method and system for discovering media account numbers of specific regions on social platform | |
CN111222031A (en) | Website distinguishing method and system | |
CN114372144A (en) | Gambling domain name identification method based on certificate and domain name resolution | |
CN110175288B (en) | Method and system for filtering character and image data for teenager group | |
CN117614644A (en) | Malicious website identification method, electronic equipment and storage medium | |
CN114764463A (en) | Internet public opinion event automatic early warning system based on event propagation characteristics | |
LI et al. | WAF‐based chinese character recognition for spam image filtering | |
CN117194773A (en) | Website identification method and device based on multi-modal characteristics | |
CN110083760A (en) | A kind of more recordable type dynamic web page information extracting methods based on visible-block |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |