In order to solve the deficiencies in the prior art, the objective of the invention is to pay close attention to the sensitive information identification of carrying out from web webpage characteristics, further improve discrimination to sensitive web page, for this reason, the present invention proposes a kind of content-based sensitive web page identification method.
To achieve these goals, it is as follows to the present invention is based on the step of sensitive web page identification method of content: comprise pre-treatment step and identification text message step;
Pre-treatment step comprises:
Under the condition of the uniform resource locator of given webpage, obtain the source code of this webpage, carry out data distribution and pre-service, obtain text message;
Obtain image section structural information in the webpage, select significance map and look like to form effective image collection;
Identification sensitive information step comprises:
Utilize continuous responsive text identification device that text message is discerned treatment step;
Utilize the discrete text recognizer that text message is carried out identification step;
Utilize the sensitive image recognizer that the image of image collection is carried out identification step.
Described identification sensitive information step is as follows:
Utilize continuous responsive text identification device that text message is discerned processing,, then dispose if recognition result is responsive; If recognition result is insensitive, then carry out:
The discrete text recognizer carries out identification step to text message, if recognizer is exported the result greater than threshold value, then recognition result is responsive, disposes; If recognition result is insensitive, then carry out:
The sensitive image recognizer carries out identification step to the image of image collection, and the result of identification and the result of discrete responsive text identification device merge, and judges according to its fusion results whether this webpage is responsive.
The present invention is directed in the prior art, the identification of text based sensitive web page can not well be discerned the normal webpage relevant with responsive theme; Based on the sensitive web page of image identification be adopt with or the technical scheme that merges of operation, can not fundamentally improve the problem of discrimination, the present invention adopts the technical scheme of continuous responsive text identification device, discrete text recognizer and the triplicity of sensitive image recognizer to solve prior art problems, the present invention utilizes the web structural information and has constructed an image collection identification problem and carried out information fusion, improves the discrimination to sensitive web page.
Embodiment
Below in conjunction with accompanying drawing the present invention is specified.Be noted that the described example of executing only is considered as illustrative purposes, rather than limitation of the present invention.
According to the present invention, shown Fig. 1 is a system framework synoptic diagram of the present invention, and concrete steps are as follows:
At step S1: the source code that obtains given webpage URL;
At step S2: isolate the Chinese text in the source code;
At step S3: obtain the size information of image in the source code, weed out parts of images according to rule;
At step S4: utilize the continuous text sorter that the Chinese text of separating is discerned, recognition result is 1, and this webpage is responsive, then withdraws from;
At step S5: utilize the discrete text sorter that Chinese text is discerned, if recognition result greater than setting threshold, this webpage is responsive, then withdraws from;
At step S6: utilize the image classification device that image is discerned;
At step S7: the result of the result of identification and discrete text identification merges.
According to step S3, pick out important image step and comprise:
Obtain this webpage and comprise every width of cloth size of images information;
If the picture size size meets the good rule of prior statistics, this image is considered as the significance map picture, then is divided in effective image collection.
According to step S4, utilize continuous responsive text identification device identification text step to comprise:
Extract the feature of the text;
Text feature is input in the support vector machine (Support VectorMachine is called for short SVM) that has trained in advance, and the output result is that 1 text is responsive, disposes, otherwise continues to handle.
According to step S5, utilize discrete responsive text identification device identification text step to comprise:
Utilize vector space model (VSM) to extract the feature of the text;
Text feature is input in the Bayesian network that trained (Bayes Networks is called for short BNS), and the result of output is the responsive probability of text input, if probable value greater than threshold tau, then text be responsive, disposes, otherwise the continuation processing.
According to step S6, the image recognition step comprises:
Utilize the image recognition device that every width of cloth image is discerned, recognition result is N for responsive amount of images
1, recognition result is that normal amount of images is N
2
According to step S7, the information fusion step comprises:
The result of discrete text identification and the result of step S6 image recognition merge, in the formula of substitution as a result (1-1) of identification, if the result greater than 1, then this webpage be a sensitivity, otherwise is normally, disposes.
In the inventive method step S1 and step S2,, the web webpage is divided three classes based on analysis to web.The first kind is the webpage based on continuous text, and wherein continuous text is defined as the text of article character, and being characterized in has stronger semantic association between the context, have abundant semantic information to utilize.The type webpage has one piece or several pieces of articles usually.Second class is the webpage based on discrete text, and wherein discrete text refers to continuous text text in addition, and for example explanatory text around homepage or some pictures or the like mainly plays link or illustration.The 3rd class is meant the webpage based on image, and what mainly present in the webpage is image information, and adding has a spot of discrete text.
Particularly, the present invention is for the webpage of the first kind, and continuous text is main, selects for use in conjunction with filter method semantic and statistics, has defined three class keywords and has provided descriptive definition:
The first kind is explicit keyword, and this class keyword only may appear at responsive text the inside basically, statistically is exactly the probability very big (approaching 1) that appears at responsive text the inside, and appears at the probability very little (approaching 0) inside the normal text.From semantically, itself is just carrying sensitive information these speech.
Second class is the implicit expression keyword, and this class keyword did not carry any sensitive information originally.But for a certain reason, this class speech in responsive text generating fixing contact, that is to say that these speech also are to occur with very big probability in responsive text the inside, also can occur certainly in other text the inside.
The 3rd class formula logic keyword, this class keyword is divided into two classes: a class is a polysemant, promptly this class keyword is normal in normal text the inside meaning, carries sensitive information in responsive text the inside; An other class keyword mainly be that certain speech is arranged in pairs or groups after, carrying sensitive information jointly.And this collocation, we can be divided into two kinds, and a kind of is the explicit logic that adds, and a kind of is the logical add logic.Based on above-mentioned definition, chosen keyword set, make up semantic rules simultaneously and described semantic association between the vocabulary, help correct characteristic information extraction.Feature after proposing is through after the normalization, as the proper vector of this continuous text.By step S4, select for use support vector machine (Support Vector Machine, be called for short SVM) as sorter, feature is trained and classified, output decides whether this webpage is sensitive web page according to SVM.
Particularly, the present invention is for the webpage of second type, according to step S4, an artificial constructed lists of keywords, behind the statistics of the text in webpage keyword, be input to the Bayes network the inside that trains as proper vector after the normalization, decide according to the output of network whether this webpage is sensitive web page.
Particularly, the present invention by step S3, obtains the satisfactory image of part of webpage the inside for the webpage of the 3rd type according to size; By step S6, utilize the image classification device that image is discerned one by one, the result of identification is (N
1, N
2), N wherein
1For recognition result is responsive image number, N
2For recognition result is normal image number.Whether be responsive priori as image simultaneously,, use and text is differentiated that the result of output is: P to the text of webpage the inside at the Bayes sorter of discrete text according to step S5
sAccording to step S7, utilize two parameters to describe image classification device: P
1Represent a secondary normal picture mistake is divided into the probability of sensitive image, P
2Represent a secondary sensitive image mistake is divided into the probability of normal picture, three following formula of parameter substitution merge:
The above-mentioned formula of each sorter output valve substitution, result calculated and threshold judge whether this webpage is sensitive web page.
In the foregoing description, each step is example, and those of ordinary skills can determine the actual step that will use according to actual conditions, and the realization of each step has several different methods, all should belong within the scope of the present invention.
Explanation at last: top description is to be used to realize the present invention and embodiment, and scope of the present invention should not described by this and limit.It should be appreciated by those skilled in the art,, all belong to claim of the present invention and come restricted portion in any modification or partial replacement that does not depart from the scope of the present invention.