CN101281521A - Method and system for filtering sensitive web page based on multiple classifier amalgamation - Google Patents

Method and system for filtering sensitive web page based on multiple classifier amalgamation Download PDF

Info

Publication number
CN101281521A
CN101281521A CNA2007100651816A CN200710065181A CN101281521A CN 101281521 A CN101281521 A CN 101281521A CN A2007100651816 A CNA2007100651816 A CN A2007100651816A CN 200710065181 A CN200710065181 A CN 200710065181A CN 101281521 A CN101281521 A CN 101281521A
Authority
CN
China
Prior art keywords
webpage
text
image
information
responsive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007100651816A
Other languages
Chinese (zh)
Other versions
CN100565523C (en
Inventor
胡卫明
陈周耀
吴偶
朱明亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CNB2007100651816A priority Critical patent/CN100565523C/en
Publication of CN101281521A publication Critical patent/CN101281521A/en
Application granted granted Critical
Publication of CN100565523C publication Critical patent/CN100565523C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a system and a method for filtering sensitive webpage, which is based on multi-classifier fusion. The processing object is a webpage, and the processing result is whether the webpage contains sensitive content, which may be pornography, reaction, violence and other unhealthy Internet contents harmful to society. The system comprises a data stream obtaining and preprocessing unit, an image and text stream filtering unit and an information fusion unit of image filter and text filter, by the cooperation of multiple classifiers, the system acquires source code of a webpage by using the URL of the webpage, a text and an image are separated at preprocessing stage to obtain text information and effective image information; an input webpage is divided into three modes by decision tree algorithm; the webpage is recognized by using a consecutive text classifier, a discrete sensitive text classifier and an image classifier, the output result recognized by the classifiers is fused and calculated, then a judge factor is given, and the final result is returned to a browser.

Description

A kind of filtering sensitive web page method and system based on multiple Classifiers Combination
Technical field
The present invention relates to the information filtering technical field, refer to that especially identification contains the method for the webpage of sensitive information.
Background technology
Because the internet sensitive information has caused great harm for Internet user especially teenager, therefore caused the extensive concern of researcher and industry.
A variety of sensitive information filter methods are arranged at present, comprise black and white lists, IP filtration and keyword coupling or the like filtration means.Generally speaking; on the one hand; these filtering techniques adopt a kind of very mechanical mode; can reach 100% filtration efficiency to some sensitive web pages; response time is also very short; but the cycle that filtration parameter upgrades can only followed the appearance of actual sensitive web page and changed, and can not tackle the quick variation of actual responsive website.On the other hand,, therefore caused very high mistake filterability, influenced user's normal online because the content information of webpage does not utilize basically or seldom utilizes.
Content-based sensitive information intelligent identification technology is a developing direction of filtering technique in recent years.At present existing multiple content-based sensitive information recognition methods.
On the responsive text identification of the general main foundation of the present sensitive web page identification method basis.Therefore core is the processing to text, at first extracts the text in the webpage, extracts feature then, utilizes the sorting algorithm of machine learning the inside to come feature is trained and classified then.What wherein feature extracting methods adopted usually is: (1) artificial given lists of keywords; (2) utilize the method for text matches to add up the number of times that each keyword occurs; (3) number of times of each keyword appearance is formed a vector, and after processing such as normalization, this vector is as the proper vector of the text.General given keyword number is less than 100.Choosing sorter then trains and predicts.People such as Singapore Pui Y.Lee utilize the Kohonen self organizing neural network as sorter, have obtained actual effect preferably.Also have some sensitive image recognition methodss, people such as the Yang Jin of Institute of Automation, Chinese Academy of sociences cutting edge of a knife or a sword have proposed a kind of content-based sensitive image recognition methods, have obtained to surpass 80% discrimination on the CAMPAQ database.
Filter method with machinery is similar, above method is not well utilized the web feature, can't reach satisfied effect at present, for example the identification of text based sensitive web page can not well be discerned the normal webpage relevant with responsive theme, and the false recognition rate of discerning based on the sensitive web page of image is very high.Already present blending algorithm also only be by with or the operation merge, can not fundamentally improve discrimination.
Summary of the invention
The identification of prior art text based sensitive web page can not well be discerned the normal webpage relevant with responsive theme, false recognition rate height based on the identification of the sensitive web page of image, the blending algorithm that adopts be by with or operation merge, can not fundamentally improve discrimination, in order to solve these problems of prior art, the objective of the invention is from web webpage characteristics, a kind of filtering sensitive web page method and system based on multiple Classifiers Combination are provided.
In order to realize described purpose, an aspect of of the present present invention provides the filtering sensitive web page method based on multiple Classifiers Combination, comprises the steps:
Step S1: obtain the source code of target web URL(uniform resource locator), carry out pre-service, be used to obtain Chinese text information, obtain effective image collection information in the webpage;
Step S2: provide information based on pre-service, utilize C4.5 algorithm in the decision tree learning will import the webpage Chinese text and effectively image generate text, image and text and image mix the webpage pattern, be used to obtain text flow, image stream and text and image blend stream information;
Step S3: utilize the multi-categorizer identification and the assignment of allocation of webpage pattern to concern, obtain target web;
Step S4: judge comprehensively according to recognition result whether target web is responsive, if responsive, then execution in step 5, if insensitive, then execution in step 6;
Step S5: the sensitive web page of identification is sent into Web browser, and caution user institute browsing page contains sensitive content in browser, browses being under an embargo;
Step S6: the former webpage of normal demonstration in Web browser.
Described sorter identification comprises: utilize continuous responsive text classifier to discern, utilize sensitive image sorter device to discerning based on the image collection in the webpage pattern of image and then utilizing discrete responsive text classifier and sensitive image Multiple Classifier Fusion to discern to mixed type webpage pattern based on the webpage pattern of text.
Describedly obtain in the webpage effectively that image step comprises:
Step 11:, obtain this webpage and comprise every width of cloth size of images and positional information by analyzing web page Hypertext Markup Language code at pretreatment stage, be used for the whole content of recognition objective webpage;
Step 12: if dimension information and positional information meet the good rule of prior statistics, then with in the extremely effective image collection of this image division.
The step that described decision tree C4.5 algorithm will be imported webpage generation webpage pattern comprises:
Step 21: image obtains the change in gain of information entropy and classification front and back information entropy based on the classification of amount of pixels in computation attribute concentrated webpage URL(uniform resource locator), webpage Chinese version length and the webpage;
Step 22: the information entropy gain as classificatory scale, is provided classification foundation, and the property set of promptly getting maximum information entropy gain is divided into final decision;
Step 23: repeating step 22 all is divided up to all properties collection, thereby forms decision tree and classifying rules.
The continuous responsive text classifier of described utilization carries out identification step to the webpage based on literal and comprises:
Step 1): with the large-scale parallel computation network on cell neural network (CNN) the definition N dimension discrete space, a node on the network as a keyword, is described the connection between the node, be used for generating the semantic relation between the text vocabulary;
Step 2): utilize the semantic relation between the vocabulary in the text,, be used to obtain the statistical nature of the activation number of times of node as text with inhibition mutual between the node and activation;
Step 3): as input, select the sorter of support vector machine (SVM) with statistical nature for use, the text that obtains in the pre-service webpage is classified, obtain classification results as training and prediction.
The discrete responsive text classifier of described utilization is discerned the webpage Chinese words of mixed type:
At first utilize vector space model (VSM) to extract the feature of discrete responsive text; Discrete responsive text feature is input in the Bayesian network (Bayes Networks is called for short BNS) that has trained, and the result of output is responsive classification results for the responsive probable value of text input if this probable value, then obtains the text greater than threshold value.
The described information fusion step that the image recognition and the literal of mixed type webpage are discerned comprises:
At first utilize the image recognition device that every width of cloth image of mixed type webpage is discerned, obtaining recognition result is responsive amount of images N 1, obtaining image recognition result is normal amount of images N 2
The result of discrete text identification and the result of above-mentioned image recognition merge, if the result greater than threshold value, then this webpage is a sensitivity, otherwise is normal webpage.
In order to realize described purpose, another aspect of the present invention provides a kind of filtering sensitive web page system based on multiple Classifiers Combination, comprising: the obtaining and pretreatment unit of data stream, generate the text flow and the image stream of former webpage, and on this basis former webpage is divided into the webpage pattern; Image and text flow filter element at the different web pages pattern, use the respective classified device that text and image are discerned; The information fusion unit of picture filter and text filter, at mixed type webpage pattern, whether by merging combining image filtrator and text filter, obtaining finally is the recognition result of sensitive kinds.
The browser core control based on the IE kernel that the present invention has utilized Microsoft to provide has been finished the data distribution transmission, utilize the multi-categorizer cooperation to finish Intelligent Recognition, utilize the network navigation technology to finish the data interaction of filtrator and browser, solved strict control problem sensitive information visit on the network.The system handles time of the present invention is very fast, and the single width webpage processing time, the accuracy rate of result also can reach more than 80% less than 10 seconds.Thereby good application prospects is arranged in filed of network information security.
Description of drawings
Fig. 1 illustrates the relations of distribution of three kinds of webpage patterns and sorter
Fig. 2 (a) gif training set effectively/invalid picture size distributes,
Fig. 2 (b) jpg training set effectively/invalid image distribution
Fig. 3 is a multi-categorizer sensitive web page identification method The general frame of the present invention
Fig. 4 is a multi-categorizer sensitive web page recognition system block diagram of the present invention
Embodiment
Below in conjunction with accompanying drawing the present invention is described in detail, be to be noted that described embodiment only is intended to be convenient to the understanding of the present invention, and it is not played any qualification effect.
Shown in the filtering sensitive web page system that Fig. 4 the present invention is based on multiple Classifiers Combination, comprising: obtaining and pretreatment unit 1 of data stream generates the text flow and the image stream of former webpage, and on this basis former webpage is divided into the webpage pattern; Image and text flow filter element 2 at the different web pages pattern, use the respective classified device that text and image are discerned; The information fusion unit 3 of picture filter and text filter, at mixed type webpage pattern, whether by fusion formula combining image filtrator and text filter, obtaining finally is the recognition result of sensitive kinds.In sum, obtain and the pretreatment unit 1 of data stream obtain text and image stream with the webpage parsing, utilize the C4.5 algorithm that webpage is classified as the webpage pattern; Image and text flow filter element 2 be at the different web pages pattern of dividing with pretreatment unit 1 that obtains of data stream, use corresponding sorter identification data streams obtain with pretreatment unit 1 in resolve text and the image stream that produces; The information fusion unit 3 of picture filter and text filter is at handling the mixed type webpage of handling in image and the text flow filter element 2, text and image classification substitution fusion formula as a result with image and 2 generations of text flow filter element obtain comprehensive recognition result.Identification finishes.
The present invention is in the windows XP of Microsoft platform, VC6.0, the mode with the MS internet explorer plug-in unit under the VC.Net programmed environment realizes, but through the experiment true(-)running on PC and computer terminal.
In the methods of the invention, based on analysis, the web webpage is divided three classes to web.Shown in the relations of distribution that Fig. 1 illustrates three kinds of webpage patterns and sorter: the first kind is the webpage based on text, its Chinese version mostly is the text of article character, novel for example, news, personage's biography etc., be characterized in that stronger semantic association is arranged between the context, have abundant semantic information to utilize.The type webpage includes one piece or several pieces of articles usually.Second class is meant the webpage based on image, and what mainly present in the webpage is image information, and additional have a spot of dispersion text, plays the aid illustration effect.The webpage of this type mainly presents with the form of picture library.The 3rd class also is that the most general webpage pattern is the webpage that writings and image mixes, text wherein also is that piecemeal disperses to occur, mainly play link or illustration, comprise multiple image in addition in the webpage to enrich the content of webpage, the webpage of this pattern mainly contains the homepage and the broadcasting bulletin system (BBS) of some famous portal websites.
Provide information based on pre-service, comprise webpage URL, webpage Chinese version length, image, utilizes C4.5 algorithm in the decision tree learning will import webpage and is divided into three kinds of patterns as defined above as community set based on classification of amount of pixels etc. in the webpage.Use corresponding sorter to carry out the strategy of dividing and rule to three types webpage then.
For the webpage (based on text) of first kind of pattern, utilization cell neural network CNN handles, and the difference of CNN and other neural network maximums is that information only exchanges between adjacent cells, and the processing of global information is realizing by local message alternately then.Cell neural network can be any dimension, but modal be one dimension or two dimension.In one dimension cell neural network, to be each cell link to each other with 2r+1 cell on every side (comprise it oneself) modal connected mode.Modal connected mode is that Von Neumann connects and to be connected with Moore in two-dimensional network, and its each cell and its Von Neumann link to each other with cell in the Moore neighborhood.Formalized description to a cell location mode is:
x(t+1)=g(x(t))+I(t)+f 1(y(t))+f 2(u(t))
y(t)=f(x(t))
Wherein x is the internal state of cell, and y is its output, and u is outside input, and I is a deviation, f 1And f 2Be two functions.
In order to make up responsive vocabulary network, at first traditional keyword is divided three classes:
(1) explicit key word; (2) implicit expression key word (3) logic keys;
Wherein, explicit key word has determined logic keys, also has inherent contact simultaneously between explicit key word and the implicit expression key word.Utilize the relation between the three, can construct our association's feedback network.
In order to utilize cell neural network, we define a node is a vocabulary, and this vocabulary has three states in addition: quiet attitude, hide attitude and excited state.Be connected according to semantic association between node and the node, computation rule is: in case node finishes to stimulate or input to one, determine the next state of this node so according to the state of this node state in the past and node on every side and the semantic rules that is connected representative.
Quiet attitude is defined as node and does not also accept the state that an input is; Hiding attitude is defined as node and has accepted input, but its parameter and on every side the parameter of node can not reach its shooting conditions; Excited state is defined as node and has accepted input and received to excite.In case a node is excited, we just add up the number of times that this node occurs so, and all excite degree of node to train as a vector and predict at last.Select for use support vector machine (Support Vector Machine, be called for short SVM) as sorter, the feature that above-mentioned vector is formed is trained and is classified, and output decides whether this webpage is sensitive web page according to SVM.
For the webpage (based on image) of second kind of pattern, extract then in the webpage effectively that image collection utilizes the image classification device to gather identification, if differentiate, then this webpage is differentiated for responsive for responsive picture number surpasses predetermined threshold.
For the webpage (mixed type) of the third pattern, at first obtain effective image collection of webpage the inside according to size, utilize the image classification device that image is discerned one by one then, the result of identification is (N 1, N 2), N wherein 1For recognition result is responsive image number, N 2For recognition result is normal image number.Whether be responsive priori as image simultaneously, use and text is differentiated that the output result is: P to the text of webpage the inside at the Bayes sorter of discrete text sThen with three output parameter N of each sorter 1, N 2, P sThe substitution fusion formula obtains one and differentiates factor f, compares with predetermined threshold by this factor f and judges whether this webpage is sensitive web page.
As Fig. 3 is shown in the multi-categorizer sensitive web page identification method overview flow chart of the present invention, comprises as follows particularly:
Step 1) is obtained the source code of given target web uniform resource position mark URL, isolates the Chinese text in the source code.
, at the difficult point of resolving the source resolution program is improved then upward about the relevant documentation of Html and XML based on W3C.Strictly speaking, the Html document is a kind of tree structure completely, but makes actual document can be not occur with the hierarchical structure of strictness to the loose regulation of some marks in the standard.At first obtain the Hypertext Markup Language Html source code of target web, afterwards the Html document is resolved, this resolving is divided into 3 sub-steps:
(1) ultimate analysis of document generates sequence node;
(2) structure/grammatical analysis of element sequence generates initial Html tree;
(3) Html tree reconstruct.Based on the content of text that comprises between various tag marks in the Html tree that generates, it is separated as the Chinese text in the source code flow.
Step 2) obtains size of images size and positional information in the source code, weed out parts of images, obtain effective image collection according to dependency rule.
The expense of handling image is very big, if major part all is invalid picture in the webpage, can make a big impact to system performance.We place above the other things by dimension of picture, because just specify its size when comprising picture in the HTML standard supported web page, therefore can only just neglect invalid picture from html file itself, and not need to download in addition them.This has also reduced network overhead simultaneously, and generally speaking, it is more consuming time than analyzing it to download a width of cloth picture from network.
Webpage has all comprised a considerable amount of images usually.In general, the webpage that both pictures and texts are excellent may comprise dozens or even hundreds of width of cloth image.But with people's subjective estimation, though that this webpage comprises picture is more, quantity should be about tens width of cloth.The picture number of actual count and subjective feeling difference are very big to be a lot of fully for web page frame need play decoration function because of having in the picture, goes back some owing to comprise information very little, or the position problems in webpage, can not arouse people's attention.Actual needs identification then is effectively image collection of the inside, and this validity shows two aspects, the one, and picture size, the 2nd, the picture position is used for the whole content of recognition objective webpage.As shown in Figure 2, horizontal ordinate and ordinate are respectively the width and the height of image, and coordinate adopts numerical expression.In this state, the effective cluster feature of image as can be seen clearly.We are just according to this feature establishment classification policy.The position that image occurs is the another one important index, has gone through the influence of structure of web page feature to web page element hereinbefore.Accordingly, being in its validity of picture of web page core position should be greater than the picture that is in corner location.Last go out in the webpage effectively image collection as image stream according to above Rule Extraction.
3) according to step 1) and step 2) in Chinese text and effective image collection in the webpage that extracts, constitute community set, based on these community sets, the study formula with its substitution C4.5 decision Tree algorithms obtains decision rule.Afterwards if with the text of target web and attributes of images set with reference to the decision rule classification that forms, just this webpage can be divided into automatically a kind of in three kinds of patterns: based on the webpage of text, based on the webpage of image, the webpage of mixed type.It is as follows that the decision rule of C4.5 algorithm forms formula:
C is the number (number of categories is 3 in our system) of classification, and (D j) is that part of ratio that belongs to classification j in data set D to p.So can be according to following formula definition information entropy Info (D):
Info ( D ) = - Σ j = 1 C p ( D , j ) * log 2 ( p ( D , j ) ) - - - ( 1 )
Given the community set T that k value arranged, a D so iWith regard among the corresponding representative data collection D on attribute T value be the formed subclass of that part of data of i, can go out the information gain that on property set T and data set D, produces according to following formula definition afterwards according to the different values of T:
Gain ( D , T ) = Info ( D ) - Σ i = 1 k | D i | | D | * Info ( D i ) - - - ( 2 )
The C4.5 algorithm is chosen that attribute that has the maximum information gain at every turn and is formed decision tree (decision rule) as the division node according to information gain, and later classification is as long as come just passable according to this rule that has formed.
The webpage property set that utilizes among the present invention is as shown in the table: webpage URL, and webpage Chinese version length, image is based on the classification of amount of pixels in the webpage.
Property set Describe
Whether be homepage character The keyword (for example " main " or " index ") that whether in the URL of webpage, includes expression homepage character
The length of general text The number of characters of general text in the webpage
The length of hypertext The number of characters of hypertext in the webpage
The number of big image Pixel value surpasses the picture number of 50,000 pixels
Medium picture number The picture number of pixel value between 10,000 and 50,000 pixels
Little picture number Pixel value is lower than the picture number of 10,000 pixels
The continuous responsive text classifier of step 4) utilization is discerned being categorized as based on the text in the webpage of text according to step 3), and recognition result is 1, and this webpage is responsive, then withdraws from.
And provided descriptive definition.The first kind is explicit keyword, and this class keyword only may appear at responsive text the inside basically, statistically is exactly the probability very big (approaching 1) that appears at responsive text the inside, and appears at the probability very little (approaching 0) inside the normal text.From semantically, itself is just carrying sensitive information these speech.Second class is the implicit expression keyword, and this class keyword did not carry any sensitive information originally.But for a certain reason, this class speech in responsive text generating fixing contact, that is to say that these speech also are to occur with very big probability in responsive text the inside, also can occur certainly in other text the inside.The 3rd class formula logic keyword, this class keyword is divided into two classes: a class is a polysemant, promptly this class keyword is normal in normal text the inside meaning, carries sensitive information in responsive text the inside; An other class keyword mainly be that certain speech is arranged in pairs or groups after, carrying sensitive information jointly.And this collocation, we can be divided into two kinds, and a kind of is the explicit logic that adds, and a kind of is the logical add logic.Based on above-mentioned definition, chosen keyword set, make up semantic rules simultaneously and described semantic association between the vocabulary, help correct characteristic information extraction.Feature after proposing is through after the normalization, as the proper vector of this continuous text.Select for use support vector machine (Support Vector Machine, be called for short SVM) as sorter, feature is trained and classified, output decides whether this webpage is sensitive web page according to SVM.
Step 5) utilizes the sensitive image sorter to discern being categorized as according to step 3) based on the effective image collection in the webpage of image, that part of picture number and predetermined threshold that the image classification device is differentiated for sensitivity compare, deciding according to this whether this webpage is sensitive web page, is responsive if differentiate for responsive picture number has surpassed threshold value then this webpage is differentiated.
Step 6) utilizes the blending algorithm of discrete text sorter and sensitive image sorter to carry out fusion recognition to being categorized as according to step 3) in the mixed type webpage of (promptly comprising the text that great amount of images comprises some again), an at first artificial constructed lists of keywords, behind the statistics of the text in webpage keyword, proper vector as discrete responsive text after the normalization is input to the Bayes network the inside that trains, by the discrete text sorter Chinese text is discerned, and obtain the discrete text classification factor, specific algorithm is described below:
Definition of T={ t at first 1, t 2..., t | T|As the training set of category Cj; C={c 1, c 2..., c | C|As classification; W={w 1, w 2..., w | V|As keyword set.In addition, definition N (w, t i) as at document d iThe number of times that middle keyword w occurs, the just word frequency of w.
Calculate probability P (w|C then j), this probability is represented keyword w and a classification C jThe size of the degree that is associated:
P ( w | C j ) = 1 + Σ i = 1 | T | N ( w , t i ) | W | + Σ s = 1 | V | Σ i = 1 | T | N ( w s , t i ) - - - ( 3 )
At processing target text t iThe time, calculating probability P (C j| t i) as the discrete text sorter factor, this probability is represented target text t iBelong to a classification C jPossibility have muchly on earth, wherein need to utilize probability P (w|C above-mentioned j).Here used a Bayes independence assumption: P ( w 1 , w 2 . . . w n | C j ) = Π i P ( w i | C j ) . Be that the semantic relation that implicit expression must have been expressed between the webpage Chinese version keyword of the 3rd class mixed type is not very tight, can be considered as existing independent the dispersion.
P ( C j | t i ) = P ( C j ) Π k = 1 | V | P ( w k | C j ) N ( w k , t i ) Σ r = 1 | C | P ( C r ) Π k = 1 | V | P ( w k | C r ) N ( w k , t i ) - - - ( 4 )
For the webpage of the 3rd type, obtain the satisfactory image of part of webpage the inside according to size, utilize the image classification device that image is discerned one by one then, the result of identification is (N 1, N 2), N wherein 1For recognition result is the responsive normal image number of figure, N 2For recognition result is the picture number.Whether be responsive priori as image simultaneously, use and text is differentiated that the discrete text sorter factor promptly above-mentioned is designated as P to the text of webpage the inside at the Bayes sorter of discrete text sUtilize two parameters to describe image classification device: P 1Represent a secondary normal picture mistake is divided into the probability of sensitive image, P 2Expression is divided into a secondary sensitive image mistake probability of normal picture.Three following formula of parameter substitution:
f = ( 1 - p 2 ) N 1 p 2 N 2 p 1 N 1 ( 1 - p 1 ) N 2 * P s 1 - P s - - - ( 5 )
Obtain one and differentiate factor f, compare with predetermined threshold by this factor f and judge whether this webpage is sensitive web page.
Step 7) is differentiated the result with final sensitivity and is returned to the web browser, and the result is the responsive demonstration that then stops this webpage in client, and the result is non-sensitive then normally demonstration.
The above; only be the embodiment among the present invention; but protection scope of the present invention is not limited thereto; anyly be familiar with the people of this technology in the disclosed technical scope of the present invention; can understand conversion or the replacement expected; all should be encompassed in of the present invention comprising within the scope, therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims (8)

1. filtering sensitive web page method based on multiple Classifiers Combination comprises step:
Step S1: obtain the source code of target web URL(uniform resource locator), carry out pre-service, be used to obtain Chinese text information, obtain effective image collection information in the webpage;
Step S2: provide information based on pre-service, utilize C4.5 algorithm in the decision tree learning will import the webpage Chinese text and effectively image generate text, image and text and image mix the webpage pattern, be used to obtain text flow, image stream and text and image blend stream information;
Step S3: utilize the multi-categorizer identification and the assignment of allocation of webpage pattern to concern, obtain target web;
Step S4: judge comprehensively according to recognition result whether target web is responsive, if responsive, then execution in step 5, if insensitive, then execution in step 6;
Step S5: the sensitive web page of identification is sent into Web browser, and caution user institute browsing page contains sensitive content in browser, browses being under an embargo;
Step S6: the former webpage of normal demonstration in Web browser.
2. by the described method of claim 1, it is characterized in that described sorter identification comprises: utilize continuous responsive text classifier to discern, utilize sensitive image sorter device based on the webpage pattern of text to discerning based on the image collection in the webpage pattern of image and then utilizing discrete responsive text classifier and sensitive image Multiple Classifier Fusion to discern to mixed type webpage pattern.
3. by the described method of claim 1, it is characterized in that, describedly obtain in the webpage effectively that image step comprises:
Step 11:, obtain this webpage and comprise every width of cloth size of images and positional information by analyzing web page Hypertext Markup Language code at pretreatment stage, be used for the whole content of recognition objective webpage;
Step 12: if dimension information and positional information meet the good rule of prior statistics, then with in the extremely effective image collection of this image division.
4. by the described method of claim 1, it is characterized in that the step that described decision tree C4.5 algorithm will be imported webpage generation webpage pattern comprises:
Step 21: image obtains the change in gain of information entropy and classification front and back information entropy based on the classification of amount of pixels in computation attribute concentrated webpage URL(uniform resource locator), webpage Chinese version length and the webpage;
Step 22: the information entropy gain as classificatory scale, is provided classification foundation, and the property set of promptly getting maximum information entropy gain is divided into final decision;
Step 23: repeating step 22 all is divided up to all properties collection, thereby forms decision tree and classifying rules.
5. by the described method of claim 2, it is characterized in that, utilize continuous responsive text classifier that the webpage based on literal is carried out identification step and comprise:
Step 1): with the large-scale parallel computation network on the cell neural network definition N dimension discrete space, a node on the network as a keyword, is described the connection between the node, be used for generating the semantic relation between the text vocabulary;
Step 2): utilize the semantic relation between the vocabulary in the text,, be used to obtain the statistical nature of the activation number of times of node as text with inhibition mutual between the node and activation;
Step 3): as input, select the sorter of support vector machine with statistical nature for use, the text that obtains in the pre-service webpage is classified, obtain classification results as training and prediction.
6. by the described method of claim 2, it is characterized in that, utilize discrete responsive text classifier that the webpage Chinese words of mixed type is discerned:
At first utilize vector space model to extract the feature of discrete responsive text;
Discrete responsive text feature is input in the Bayesian network that has trained, and the result of output is responsive classification results for the responsive probable value of text input if this probable value, then obtains the text greater than threshold value.
7. by the described method of claim 1, it is characterized in that, the image recognition of mixed type webpage and the information fusion step of literal identification comprised:
At first utilize the image recognition device that every width of cloth image of mixed type webpage is discerned, obtaining recognition result is responsive amount of images N 1, obtaining image recognition result is normal amount of images N 2
The result of discrete text identification and the result of above-mentioned image recognition merge, if the result greater than threshold value, then this webpage is a sensitivity, otherwise is normal webpage.
8. filtering sensitive web page system based on multiple Classifiers Combination is characterized in that:
Obtaining and pretreatment unit (1) of data stream generates the text flow and the image stream of former webpage, and on this basis former webpage is divided into the webpage pattern;
Image and text flow filter element (2) at the different web pages pattern, use the respective classified device that text and image are discerned;
The information fusion unit (3) of picture filter and text filter, at mixed type webpage pattern, whether by merging combining image filtrator and text filter, obtaining finally is the recognition result of sensitive kinds.
CNB2007100651816A 2007-04-05 2007-04-05 A kind of filtering sensitive web page method and system based on multiple Classifiers Combination Active CN100565523C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007100651816A CN100565523C (en) 2007-04-05 2007-04-05 A kind of filtering sensitive web page method and system based on multiple Classifiers Combination

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007100651816A CN100565523C (en) 2007-04-05 2007-04-05 A kind of filtering sensitive web page method and system based on multiple Classifiers Combination

Publications (2)

Publication Number Publication Date
CN101281521A true CN101281521A (en) 2008-10-08
CN100565523C CN100565523C (en) 2009-12-02

Family

ID=40013998

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007100651816A Active CN100565523C (en) 2007-04-05 2007-04-05 A kind of filtering sensitive web page method and system based on multiple Classifiers Combination

Country Status (1)

Country Link
CN (1) CN100565523C (en)

Cited By (63)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101969466A (en) * 2010-10-18 2011-02-09 上海电机学院 Method for searching network services in distributed system
CN101604322B (en) * 2009-06-24 2011-09-07 北京理工大学 Decision level text automatic classified fusion method
CN102306287A (en) * 2011-08-24 2012-01-04 百度在线网络技术(北京)有限公司 Method and equipment for identifying sensitive image
CN102541913A (en) * 2010-12-15 2012-07-04 中国人民解放军国防科学技术大学 Web-oriented VSM (vector space model) classifier training method, web-oriented OSSP (open resource software page) identifying method and Web-oriented OSS (open resource software) resource extracting method
CN102567319A (en) * 2010-12-10 2012-07-11 腾讯科技(深圳)有限公司 Webpage picture filter method and system utilizing same
CN102567512A (en) * 2011-12-27 2012-07-11 深信服网络科技(深圳)有限公司 Method and device for webpage video control by classification
CN101763502B (en) * 2008-12-24 2012-07-25 中国科学院自动化研究所 High-efficiency method and system for sensitive image detection
US8515164B2 (en) 2009-07-02 2013-08-20 Alibaba Group Holding Limited Non-product image identification
CN103366019A (en) * 2013-08-06 2013-10-23 飞天诚信科技股份有限公司 Webpage intercepting method and device based on iOS device
WO2014187038A1 (en) * 2013-05-22 2014-11-27 中兴通讯股份有限公司 Intelligent mobile terminal and data processing method therefor
CN104361059A (en) * 2014-11-03 2015-02-18 中国科学院自动化研究所 Harmful information identification and web page classification method based on multi-instance learning
CN104376304A (en) * 2014-11-18 2015-02-25 新浪网技术(中国)有限公司 Identification method and device for text advertisement image
CN104391860A (en) * 2014-10-22 2015-03-04 安一恒通(北京)科技有限公司 Content type detection method and device
CN104407839A (en) * 2014-10-31 2015-03-11 北京思特奇信息技术股份有限公司 Complex calculation logic analytical method and device
US9037587B2 (en) 2012-05-10 2015-05-19 International Business Machines Corporation System and method for the classification of storage
CN104866780A (en) * 2015-04-24 2015-08-26 广东电网有限责任公司信息中心 Unstructured data asset reveal prevention method based on hierarchical classification
CN104965905A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Web page classifying method and apparatus
CN105247568A (en) * 2013-01-10 2016-01-13 宝视纳股份公司 Method and device for creating an improved colour image with a sensor with a colour filter
CN105320961A (en) * 2015-10-16 2016-02-10 重庆邮电大学 Handwriting numeral recognition method based on convolutional neural network and support vector machine
CN105391708A (en) * 2015-11-02 2016-03-09 北京锐安科技有限公司 Audio data detection method and device
CN105550182A (en) * 2014-11-01 2016-05-04 江苏威盾网络科技有限公司 Webpage classification control device and method based on cloud
CN105912648A (en) * 2016-04-08 2016-08-31 南京大学 Side information-based code snippet programming language detecting method
CN106021582A (en) * 2016-06-02 2016-10-12 腾讯科技(深圳)有限公司 Position information filtering method and method and device for extracting effective webpage information
CN106294535A (en) * 2016-07-19 2017-01-04 百度在线网络技术(北京)有限公司 The recognition methods of website and device
CN106528869A (en) * 2016-12-05 2017-03-22 深圳大图科创技术开发有限公司 Topic detection apparatus
CN106682694A (en) * 2016-12-27 2017-05-17 复旦大学 Sensitive image identification method based on depth learning
CN106776842A (en) * 2016-11-28 2017-05-31 腾讯科技(上海)有限公司 Multi-medium data detection method and device
CN106845717A (en) * 2017-01-24 2017-06-13 哈尔滨工业大学 A kind of energy efficiency evaluation method based on multi-model convergence strategy
CN106909654A (en) * 2017-02-24 2017-06-30 北京时间股份有限公司 A kind of multiclass classification system and method based on newsletter archive information
CN107077471A (en) * 2014-11-27 2017-08-18 隆沙有限公司 Stop the word being classified
CN107679075A (en) * 2017-08-25 2018-02-09 北京德塔精要信息技术有限公司 Method for monitoring network and equipment
CN107688576A (en) * 2016-08-04 2018-02-13 中国科学院声学研究所 The structure and tendentiousness sorting technique of a kind of CNN SVM models
CN107766234A (en) * 2017-08-31 2018-03-06 广州数沃信息科技有限公司 A kind of assessment method, the apparatus and system of the webpage health degree based on mobile device
CN108052556A (en) * 2017-11-29 2018-05-18 成都东方盛行电子有限责任公司 A kind of sorting technique based on big data
CN108319672A (en) * 2018-01-25 2018-07-24 南京邮电大学 Mobile terminal malicious information filtering method and system based on cloud computing
CN109656141A (en) * 2019-01-11 2019-04-19 武汉天喻聚联网络有限公司 Violation identification and machine behaviour control method, equipment, storage medium based on artificial intelligence technology
CN109947967A (en) * 2017-10-10 2019-06-28 腾讯科技(深圳)有限公司 Image-recognizing method, device, storage medium and computer equipment
CN109947760A (en) * 2017-07-26 2019-06-28 华为技术有限公司 It is a kind of excavate KPI root because method and device
CN110036399A (en) * 2016-11-29 2019-07-19 微软技术许可有限责任公司 Neural Network Data input system
CN110147817A (en) * 2019-04-11 2019-08-20 北京搜狗科技发展有限公司 Training data set creation method and device
CN110163033A (en) * 2018-02-13 2019-08-23 京东方科技集团股份有限公司 Positive sample acquisition methods, pedestrian detection model generating method and pedestrian detection method
CN110245227A (en) * 2019-04-25 2019-09-17 义语智能科技(广州)有限公司 The training method and equipment of the integrated classification device of text classification
CN110275958A (en) * 2019-06-26 2019-09-24 北京市博汇科技股份有限公司 Site information recognition methods, device and electronic equipment
CN110321936A (en) * 2019-06-14 2019-10-11 浙江鹏信信息科技股份有限公司 A method of realizing that picture two is classified based on VGG16 and SVM
CN110879963A (en) * 2019-09-18 2020-03-13 北京印刷学院 Sensitive expression package detection method and device and electronic equipment
CN110909224A (en) * 2019-11-22 2020-03-24 浙江大学 Sensitive data automatic classification and identification method and system based on artificial intelligence
CN111008329A (en) * 2019-11-22 2020-04-14 厦门美柚股份有限公司 Page content recommendation method and device based on content classification
CN111241286A (en) * 2020-01-16 2020-06-05 东方红卫星移动通信有限公司 Short text emotion fine classification method based on mixed classifier
CN111259237A (en) * 2020-01-13 2020-06-09 中国搜索信息科技股份有限公司 Method for identifying public harmful information
CN111597310A (en) * 2020-05-26 2020-08-28 成都卫士通信息产业股份有限公司 Sensitive content detection method, device, equipment and medium
CN111652622A (en) * 2020-05-26 2020-09-11 支付宝(杭州)信息技术有限公司 Risk website identification method and device and electronic equipment
CN111783789A (en) * 2020-06-30 2020-10-16 青海民族大学 Image sensitive information identification method
CN111832588A (en) * 2019-04-18 2020-10-27 四川大学 Riot and terrorist image labeling method based on integrated classification
CN112183465A (en) * 2020-10-26 2021-01-05 天津大学 Social relationship identification method based on character attributes and context
CN112199564A (en) * 2019-07-08 2021-01-08 Tcl集团股份有限公司 Information filtering method and device and terminal equipment
CN112258254A (en) * 2020-12-21 2021-01-22 中国人民解放军国防科技大学 Internet advertisement risk monitoring method and system based on big data architecture
CN112632355A (en) * 2020-11-26 2021-04-09 武汉虹旭信息技术有限责任公司 Fragment content processing method and device for harmful information
CN113177409A (en) * 2021-05-06 2021-07-27 上海慧洲信息技术有限公司 Intelligent sensitive word recognition system
CN113378881A (en) * 2021-05-11 2021-09-10 广西电网有限责任公司电力科学研究院 Instruction set identification method and device based on information entropy gain SVM model
WO2021185113A1 (en) * 2020-03-17 2021-09-23 华为技术有限公司 Data analysis method based on multiple analysis tasks and electronic device
CN113849760A (en) * 2021-12-02 2021-12-28 云账户技术(天津)有限公司 Sensitive information risk assessment method, system and storage medium
CN113869803A (en) * 2021-12-02 2021-12-31 云账户技术(天津)有限公司 Enterprise sensitive information risk assessment method, system and storage medium
CN114782670A (en) * 2022-05-11 2022-07-22 中航信移动科技有限公司 Multi-mode sensitive information identification method, equipment and medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595422B (en) * 2018-04-13 2022-05-10 卓望信息技术(北京)有限公司 Method for filtering bad multimedia messages

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7016977B1 (en) * 1999-11-05 2006-03-21 International Business Machines Corporation Method and system for multilingual web server
JP2001331362A (en) * 2000-03-17 2001-11-30 Sony Corp File conversion method, data converter and file display system

Cited By (93)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763502B (en) * 2008-12-24 2012-07-25 中国科学院自动化研究所 High-efficiency method and system for sensitive image detection
CN101604322B (en) * 2009-06-24 2011-09-07 北京理工大学 Decision level text automatic classified fusion method
US8515164B2 (en) 2009-07-02 2013-08-20 Alibaba Group Holding Limited Non-product image identification
CN101969466A (en) * 2010-10-18 2011-02-09 上海电机学院 Method for searching network services in distributed system
CN102567319B (en) * 2010-12-10 2016-08-24 深圳市世纪光速信息技术有限公司 Webpage picture filter method and system
CN102567319A (en) * 2010-12-10 2012-07-11 腾讯科技(深圳)有限公司 Webpage picture filter method and system utilizing same
CN102541913A (en) * 2010-12-15 2012-07-04 中国人民解放军国防科学技术大学 Web-oriented VSM (vector space model) classifier training method, web-oriented OSSP (open resource software page) identifying method and Web-oriented OSS (open resource software) resource extracting method
CN102541913B (en) * 2010-12-15 2017-10-03 中国人民解放军国防科学技术大学 VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented
CN102306287B (en) * 2011-08-24 2017-10-10 百度在线网络技术(北京)有限公司 A kind of method and equipment for identifying a sensitive image
CN102306287A (en) * 2011-08-24 2012-01-04 百度在线网络技术(北京)有限公司 Method and equipment for identifying sensitive image
CN102567512A (en) * 2011-12-27 2012-07-11 深信服网络科技(深圳)有限公司 Method and device for webpage video control by classification
CN102567512B (en) * 2011-12-27 2014-12-17 深信服网络科技(深圳)有限公司 Method and device for webpage video control by classification
US9037587B2 (en) 2012-05-10 2015-05-19 International Business Machines Corporation System and method for the classification of storage
US9262507B2 (en) 2012-05-10 2016-02-16 International Business Machines Corporation System and method for the classification of storage
CN105247568B (en) * 2013-01-10 2019-02-22 宝视纳股份公司 The method and apparatus for generating improved color image with the sensor with coloured filter
CN105247568A (en) * 2013-01-10 2016-01-13 宝视纳股份公司 Method and device for creating an improved colour image with a sensor with a colour filter
WO2014187038A1 (en) * 2013-05-22 2014-11-27 中兴通讯股份有限公司 Intelligent mobile terminal and data processing method therefor
CN103366019B (en) * 2013-08-06 2016-09-28 飞天诚信科技股份有限公司 A kind of webpage hold-up interception method based on iOS device and equipment
CN103366019A (en) * 2013-08-06 2013-10-23 飞天诚信科技股份有限公司 Webpage intercepting method and device based on iOS device
CN104391860A (en) * 2014-10-22 2015-03-04 安一恒通(北京)科技有限公司 Content type detection method and device
CN104391860B (en) * 2014-10-22 2018-03-02 安一恒通(北京)科技有限公司 content type detection method and device
CN104407839A (en) * 2014-10-31 2015-03-11 北京思特奇信息技术股份有限公司 Complex calculation logic analytical method and device
CN105550182A (en) * 2014-11-01 2016-05-04 江苏威盾网络科技有限公司 Webpage classification control device and method based on cloud
CN104361059B (en) * 2014-11-03 2018-03-27 中国科学院自动化研究所 A kind of harmful information identification and Web page classification method based on multi-instance learning
CN104361059A (en) * 2014-11-03 2015-02-18 中国科学院自动化研究所 Harmful information identification and web page classification method based on multi-instance learning
CN104376304B (en) * 2014-11-18 2018-07-17 新浪网技术(中国)有限公司 A kind of recognition methods of text advertisements image and device
CN104376304A (en) * 2014-11-18 2015-02-25 新浪网技术(中国)有限公司 Identification method and device for text advertisement image
CN107077471A (en) * 2014-11-27 2017-08-18 隆沙有限公司 Stop the word being classified
US10902026B2 (en) 2014-11-27 2021-01-26 Longsand Limited Block classified term
CN104866780A (en) * 2015-04-24 2015-08-26 广东电网有限责任公司信息中心 Unstructured data asset reveal prevention method based on hierarchical classification
CN104866780B (en) * 2015-04-24 2018-01-05 广东电网有限责任公司信息中心 The leakage-preventing method of unstructured data assets based on classification
CN104965905A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Web page classifying method and apparatus
US10909427B2 (en) 2015-06-30 2021-02-02 Beijing Qihoo Techology Company Limited Method and device for classifying webpages
WO2017000610A1 (en) * 2015-06-30 2017-01-05 北京奇虎科技有限公司 Webpage classification method and apparatus
CN104965905B (en) * 2015-06-30 2018-05-04 北京奇虎科技有限公司 A kind of method and apparatus of Web page classifying
CN105320961A (en) * 2015-10-16 2016-02-10 重庆邮电大学 Handwriting numeral recognition method based on convolutional neural network and support vector machine
CN105391708A (en) * 2015-11-02 2016-03-09 北京锐安科技有限公司 Audio data detection method and device
CN105912648A (en) * 2016-04-08 2016-08-31 南京大学 Side information-based code snippet programming language detecting method
CN106021582A (en) * 2016-06-02 2016-10-12 腾讯科技(深圳)有限公司 Position information filtering method and method and device for extracting effective webpage information
CN106021582B (en) * 2016-06-02 2020-06-05 腾讯科技(深圳)有限公司 Method for filtering position information, method and device for extracting effective webpage information
CN106294535A (en) * 2016-07-19 2017-01-04 百度在线网络技术(北京)有限公司 The recognition methods of website and device
CN107688576A (en) * 2016-08-04 2018-02-13 中国科学院声学研究所 The structure and tendentiousness sorting technique of a kind of CNN SVM models
CN107688576B (en) * 2016-08-04 2020-06-16 中国科学院声学研究所 Construction and tendency classification method of CNN-SVM model
CN106776842A (en) * 2016-11-28 2017-05-31 腾讯科技(上海)有限公司 Multi-medium data detection method and device
CN110036399A (en) * 2016-11-29 2019-07-19 微软技术许可有限责任公司 Neural Network Data input system
CN106528869A (en) * 2016-12-05 2017-03-22 深圳大图科创技术开发有限公司 Topic detection apparatus
CN106682694A (en) * 2016-12-27 2017-05-17 复旦大学 Sensitive image identification method based on depth learning
CN106845717A (en) * 2017-01-24 2017-06-13 哈尔滨工业大学 A kind of energy efficiency evaluation method based on multi-model convergence strategy
CN106845717B (en) * 2017-01-24 2021-04-09 哈尔滨工业大学 Energy efficiency evaluation method based on multi-model fusion strategy
CN106909654A (en) * 2017-02-24 2017-06-30 北京时间股份有限公司 A kind of multiclass classification system and method based on newsletter archive information
CN109947760A (en) * 2017-07-26 2019-06-28 华为技术有限公司 It is a kind of excavate KPI root because method and device
CN107679075A (en) * 2017-08-25 2018-02-09 北京德塔精要信息技术有限公司 Method for monitoring network and equipment
CN107679075B (en) * 2017-08-25 2020-06-02 北京德塔精要信息技术有限公司 Network monitoring method and equipment
CN107766234A (en) * 2017-08-31 2018-03-06 广州数沃信息科技有限公司 A kind of assessment method, the apparatus and system of the webpage health degree based on mobile device
CN109947967A (en) * 2017-10-10 2019-06-28 腾讯科技(深圳)有限公司 Image-recognizing method, device, storage medium and computer equipment
CN109947967B (en) * 2017-10-10 2023-04-18 腾讯科技(深圳)有限公司 Image recognition method, image recognition device, storage medium and computer equipment
CN108052556A (en) * 2017-11-29 2018-05-18 成都东方盛行电子有限责任公司 A kind of sorting technique based on big data
CN108319672A (en) * 2018-01-25 2018-07-24 南京邮电大学 Mobile terminal malicious information filtering method and system based on cloud computing
CN108319672B (en) * 2018-01-25 2023-04-18 南京邮电大学 Mobile terminal bad information filtering method and system based on cloud computing
CN110163033B (en) * 2018-02-13 2022-04-22 京东方科技集团股份有限公司 Positive sample acquisition method, pedestrian detection model generation method and pedestrian detection method
CN110163033A (en) * 2018-02-13 2019-08-23 京东方科技集团股份有限公司 Positive sample acquisition methods, pedestrian detection model generating method and pedestrian detection method
CN109656141A (en) * 2019-01-11 2019-04-19 武汉天喻聚联网络有限公司 Violation identification and machine behaviour control method, equipment, storage medium based on artificial intelligence technology
CN110147817B (en) * 2019-04-11 2021-08-27 北京搜狗科技发展有限公司 Training data set generation method and device
CN110147817A (en) * 2019-04-11 2019-08-20 北京搜狗科技发展有限公司 Training data set creation method and device
CN111832588A (en) * 2019-04-18 2020-10-27 四川大学 Riot and terrorist image labeling method based on integrated classification
CN110245227B (en) * 2019-04-25 2021-12-28 义语智能科技(广州)有限公司 Training method and device for text classification fusion classifier
CN110245227A (en) * 2019-04-25 2019-09-17 义语智能科技(广州)有限公司 The training method and equipment of the integrated classification device of text classification
CN110321936A (en) * 2019-06-14 2019-10-11 浙江鹏信信息科技股份有限公司 A method of realizing that picture two is classified based on VGG16 and SVM
CN110275958A (en) * 2019-06-26 2019-09-24 北京市博汇科技股份有限公司 Site information recognition methods, device and electronic equipment
CN110275958B (en) * 2019-06-26 2021-07-27 北京市博汇科技股份有限公司 Website information identification method and device and electronic equipment
CN112199564A (en) * 2019-07-08 2021-01-08 Tcl集团股份有限公司 Information filtering method and device and terminal equipment
CN110879963B (en) * 2019-09-18 2023-09-05 北京印刷学院 Sensitive expression package detection method and device and electronic equipment
CN110879963A (en) * 2019-09-18 2020-03-13 北京印刷学院 Sensitive expression package detection method and device and electronic equipment
CN110909224A (en) * 2019-11-22 2020-03-24 浙江大学 Sensitive data automatic classification and identification method and system based on artificial intelligence
CN111008329A (en) * 2019-11-22 2020-04-14 厦门美柚股份有限公司 Page content recommendation method and device based on content classification
CN110909224B (en) * 2019-11-22 2022-06-10 浙江大学 Sensitive data automatic classification and identification method and system based on artificial intelligence
CN111259237A (en) * 2020-01-13 2020-06-09 中国搜索信息科技股份有限公司 Method for identifying public harmful information
CN111241286A (en) * 2020-01-16 2020-06-05 东方红卫星移动通信有限公司 Short text emotion fine classification method based on mixed classifier
WO2021185113A1 (en) * 2020-03-17 2021-09-23 华为技术有限公司 Data analysis method based on multiple analysis tasks and electronic device
CN111652622A (en) * 2020-05-26 2020-09-11 支付宝(杭州)信息技术有限公司 Risk website identification method and device and electronic equipment
CN111597310B (en) * 2020-05-26 2023-10-20 成都卫士通信息产业股份有限公司 Sensitive content detection method, device, equipment and medium
CN111597310A (en) * 2020-05-26 2020-08-28 成都卫士通信息产业股份有限公司 Sensitive content detection method, device, equipment and medium
CN111783789A (en) * 2020-06-30 2020-10-16 青海民族大学 Image sensitive information identification method
CN112183465A (en) * 2020-10-26 2021-01-05 天津大学 Social relationship identification method based on character attributes and context
CN112632355A (en) * 2020-11-26 2021-04-09 武汉虹旭信息技术有限责任公司 Fragment content processing method and device for harmful information
CN112258254A (en) * 2020-12-21 2021-01-22 中国人民解放军国防科技大学 Internet advertisement risk monitoring method and system based on big data architecture
CN112258254B (en) * 2020-12-21 2021-03-09 中国人民解放军国防科技大学 Internet advertisement risk monitoring method and system based on big data architecture
CN113177409A (en) * 2021-05-06 2021-07-27 上海慧洲信息技术有限公司 Intelligent sensitive word recognition system
CN113378881A (en) * 2021-05-11 2021-09-10 广西电网有限责任公司电力科学研究院 Instruction set identification method and device based on information entropy gain SVM model
CN113378881B (en) * 2021-05-11 2022-06-21 广西电网有限责任公司电力科学研究院 Instruction set identification method and device based on information entropy gain SVM model
CN113869803A (en) * 2021-12-02 2021-12-31 云账户技术(天津)有限公司 Enterprise sensitive information risk assessment method, system and storage medium
CN113849760A (en) * 2021-12-02 2021-12-28 云账户技术(天津)有限公司 Sensitive information risk assessment method, system and storage medium
CN114782670A (en) * 2022-05-11 2022-07-22 中航信移动科技有限公司 Multi-mode sensitive information identification method, equipment and medium

Also Published As

Publication number Publication date
CN100565523C (en) 2009-12-02

Similar Documents

Publication Publication Date Title
CN100565523C (en) A kind of filtering sensitive web page method and system based on multiple Classifiers Combination
CN108965245B (en) Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model
CN100412888C (en) Content based sensitive web page identification method
US8239387B2 (en) Structural clustering and template identification for electronic documents
CN102279894B (en) Method for searching, integrating and providing comment information based on semantics and searching system
CN104008203B (en) A kind of Users' Interests Mining method for incorporating body situation
CN105045875B (en) Personalized search and device
CN102937951B (en) Set up the method for IP address sort model, the method and device to user's classification
CN106776544A (en) Character relation recognition methods and device and segmenting method
CN101833554B (en) Method and equipment for producing extraction template and method and equipment for extracting content on web pages
CN101515272B (en) Method and device for extracting webpage content
CN106294535B (en) The recognition methods of website and device
CN109800350A (en) A kind of Personalize News recommended method and system, storage medium
CN102521248A (en) Network user classification method and device
JP2009099124A (en) Method and system for data construction
CN103246732A (en) Online Web news content extracting method and system
CN103559199A (en) Web information extraction method and web information extraction device
CN113254652B (en) Social media posting authenticity detection method based on hypergraph attention network
JP5527845B2 (en) Document classification program, server and method based on textual and external features of document information
CN109299286A (en) The Knowledge Discovery Method and system of unstructured data
CN110175288B (en) Method and system for filtering character and image data for teenager group
CN110083760B (en) Multi-recording dynamic webpage information extraction method based on visual block
CN104462241A (en) Population property classification method and device based on anchor texts and peripheral texts in URLs
CN108595466B (en) Internet information filtering and internet user information and network card structure analysis method
CN100357942C (en) Mobile internet intelligent information retrieval engine based on key-word retrieval

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant