CN101055621A

CN101055621A - Content based sensitive web page identification method

Info

Publication number: CN101055621A
Application number: CN 200610073172
Authority: CN
Inventors: 胡卫明; 吴偶; 陈周耀; 朱明亮
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2006-04-10
Filing date: 2006-04-10
Publication date: 2007-10-17
Anticipated expiration: 2026-04-10
Also published as: CN100412888C

Abstract

The invention discloses a content-based sensitive web page identification method, which comprises the steps of: obtaining the source code of the web page under the condition of a given web page uniform resource locator, data splitting and preprocessing, and obtaining text information and effective image information; using The continuous sensitive text classifier processes the text information, and if the output result of the classifier is sensitive, the processing is completed. Otherwise, the discrete sensitive text classifier is used to process the text information. If the output result of the classifier is greater than the predetermined threshold and the recognition result is sensitive, the processing is completed. Otherwise, an image classifier is used to recognize the image, and the recognition result is fused with the output result of the discrete classifier. The present invention solves the problems in the prior art by adopting a combination of a continuous sensitive text recognizer, a discrete text recognizer and a sensitive image recognizer. The present invention utilizes web structure information and constructs an image set recognition problem to perform information fusion and improve Recognition rate of sensitive web pages.

Description

Content-based sensitive web page identification method

Technical field

The present invention relates to the information filtering technical field, relate in particular to the method that identification contains the webpage of sensitive information.

Background technology

Because the internet sensitive information has caused great harm for Internet user especially teenager, therefore caused the extensive concern of researcher and industry.

A variety of sensitive information filter methods are arranged at present, comprise black and white lists, IP filtration and keyword coupling or the like filtration means.Generally speaking; on the one hand; these filtering techniques adopt a kind of very mechanical mode; can reach 100% filtration efficiency to some sensitive web pages; response time is also very short; but the cycle that filtration parameter upgrades can only followed the appearance of actual sensitive web page and changed, and can not tackle the quick variation of actual responsive website.On the other hand,, therefore caused very high mistake filterability, influenced user's normal online because the content information of webpage does not utilize basically or seldom utilizes.

Content-based sensitive information intelligent identification technology is a developing direction of filtering technique in recent years.At present existing multiple content-based sensitive information recognition methods.

On the responsive text identification of the general main foundation of the present sensitive web page identification method basis.Therefore core is the processing to text, at first extracts the text in the webpage, extracts feature then, utilizes the sorting algorithm of machine learning the inside to come feature is trained and classified then.What wherein feature extracting methods adopted usually is: (1) artificial given lists of keywords; (2) utilize the method for text matches to add up the number of times that each keyword occurs; (3) number of times of each keyword appearance is formed a vector, and after processing such as normalization, this vector is as the proper vector of the text.General given keyword number is less than 100.Choosing sorter then trains and predicts.People such as Singapore Pui Y.Lee utilize the Kohonen self organizing neural network to be used as sorter, have obtained actual effect preferably.Also have some sensitive image recognition methodss, for example our unit has proposed a kind of content-based sensitive image recognition methods, has obtained to surpass 80% discrimination on the CAMPAQ database.

Filter method with machinery is similar, above method is not well utilized the web feature, can't reach satisfied effect at present, for example the identification of text based sensitive web page can not well be discerned the normal webpage relevant with responsive theme, and the false recognition rate of discerning based on the sensitive web page of image is very high.Already present blending algorithm also only be by with or the operation merge, can not fundamentally improve discrimination.

In order to solve the deficiencies in the prior art, the objective of the invention is to pay close attention to the sensitive information identification of carrying out from web webpage characteristics, further improve discrimination to sensitive web page, for this reason, the present invention proposes a kind of content-based sensitive web page identification method.

To achieve these goals, it is as follows to the present invention is based on the step of sensitive web page identification method of content: comprise pre-treatment step and identification text message step;

Pre-treatment step comprises:

Under the condition of the uniform resource locator of given webpage, obtain the source code of this webpage, carry out data distribution and pre-service, obtain text message;

Obtain image section structural information in the webpage, select significance map and look like to form effective image collection;

Identification sensitive information step comprises:

Utilize continuous responsive text identification device that text message is discerned treatment step;

Utilize the discrete text recognizer that text message is carried out identification step;

Utilize the sensitive image recognizer that the image of image collection is carried out identification step.

Described identification sensitive information step is as follows:

Utilize continuous responsive text identification device that text message is discerned processing,, then dispose if recognition result is responsive; If recognition result is insensitive, then carry out:

The discrete text recognizer carries out identification step to text message, if recognizer is exported the result greater than threshold value, then recognition result is responsive, disposes; If recognition result is insensitive, then carry out:

The sensitive image recognizer carries out identification step to the image of image collection, and the result of identification and the result of discrete responsive text identification device merge, and judges according to its fusion results whether this webpage is responsive.

The present invention is directed in the prior art, the identification of text based sensitive web page can not well be discerned the normal webpage relevant with responsive theme; Based on the sensitive web page of image identification be adopt with or the technical scheme that merges of operation, can not fundamentally improve the problem of discrimination, the present invention adopts the technical scheme of continuous responsive text identification device, discrete text recognizer and the triplicity of sensitive image recognizer to solve prior art problems, the present invention utilizes the web structural information and has constructed an image collection identification problem and carried out information fusion, improves the discrimination to sensitive web page.

Description of drawings

By the detailed description below in conjunction with accompanying drawing, above-mentioned and others, feature and advantage of the present invention will become more apparent.In the accompanying drawing:

Fig. 1 is a system framework synoptic diagram of the present invention

Embodiment

Below in conjunction with accompanying drawing the present invention is specified.Be noted that the described example of executing only is considered as illustrative purposes, rather than limitation of the present invention.

According to the present invention, shown Fig. 1 is a system framework synoptic diagram of the present invention, and concrete steps are as follows:

At step S1: the source code that obtains given webpage URL;

At step S2: isolate the Chinese text in the source code;

At step S3: obtain the size information of image in the source code, weed out parts of images according to rule;

At step S4: utilize the continuous text sorter that the Chinese text of separating is discerned, recognition result is 1, and this webpage is responsive, then withdraws from;

At step S5: utilize the discrete text sorter that Chinese text is discerned, if recognition result greater than setting threshold, this webpage is responsive, then withdraws from;

At step S6: utilize the image classification device that image is discerned;

At step S7: the result of the result of identification and discrete text identification merges.

According to step S3, pick out important image step and comprise:

Obtain this webpage and comprise every width of cloth size of images information;

If the picture size size meets the good rule of prior statistics, this image is considered as the significance map picture, then is divided in effective image collection.

According to step S4, utilize continuous responsive text identification device identification text step to comprise:

Extract the feature of the text;

Text feature is input in the support vector machine (Support VectorMachine is called for short SVM) that has trained in advance, and the output result is that 1 text is responsive, disposes, otherwise continues to handle.

According to step S5, utilize discrete responsive text identification device identification text step to comprise:

Utilize vector space model (VSM) to extract the feature of the text;

Text feature is input in the Bayesian network that trained (Bayes Networks is called for short BNS), and the result of output is the responsive probability of text input, if probable value greater than threshold tau, then text be responsive, disposes, otherwise the continuation processing.

According to step S6, the image recognition step comprises:

Utilize the image recognition device that every width of cloth image is discerned, recognition result is N for responsive amount of images ₁, recognition result is that normal amount of images is N ₂

According to step S7, the information fusion step comprises:

The result of discrete text identification and the result of step S6 image recognition merge, in the formula of substitution as a result (1-1) of identification, if the result greater than 1, then this webpage be a sensitivity, otherwise is normally, disposes.

In the inventive method step S1 and step S2,, the web webpage is divided three classes based on analysis to web.The first kind is the webpage based on continuous text, and wherein continuous text is defined as the text of article character, and being characterized in has stronger semantic association between the context, have abundant semantic information to utilize.The type webpage has one piece or several pieces of articles usually.Second class is the webpage based on discrete text, and wherein discrete text refers to continuous text text in addition, and for example explanatory text around homepage or some pictures or the like mainly plays link or illustration.The 3rd class is meant the webpage based on image, and what mainly present in the webpage is image information, and adding has a spot of discrete text.

Particularly, the present invention is for the webpage of the first kind, and continuous text is main, selects for use in conjunction with filter method semantic and statistics, has defined three class keywords and has provided descriptive definition:

The first kind is explicit keyword, and this class keyword only may appear at responsive text the inside basically, statistically is exactly the probability very big (approaching 1) that appears at responsive text the inside, and appears at the probability very little (approaching 0) inside the normal text.From semantically, itself is just carrying sensitive information these speech.

Second class is the implicit expression keyword, and this class keyword did not carry any sensitive information originally.But for a certain reason, this class speech in responsive text generating fixing contact, that is to say that these speech also are to occur with very big probability in responsive text the inside, also can occur certainly in other text the inside.

The 3rd class formula logic keyword, this class keyword is divided into two classes: a class is a polysemant, promptly this class keyword is normal in normal text the inside meaning, carries sensitive information in responsive text the inside; An other class keyword mainly be that certain speech is arranged in pairs or groups after, carrying sensitive information jointly.And this collocation, we can be divided into two kinds, and a kind of is the explicit logic that adds, and a kind of is the logical add logic.Based on above-mentioned definition, chosen keyword set, make up semantic rules simultaneously and described semantic association between the vocabulary, help correct characteristic information extraction.Feature after proposing is through after the normalization, as the proper vector of this continuous text.By step S4, select for use support vector machine (Support Vector Machine, be called for short SVM) as sorter, feature is trained and classified, output decides whether this webpage is sensitive web page according to SVM.

Particularly, the present invention is for the webpage of second type, according to step S4, an artificial constructed lists of keywords, behind the statistics of the text in webpage keyword, be input to the Bayes network the inside that trains as proper vector after the normalization, decide according to the output of network whether this webpage is sensitive web page.

Particularly, the present invention by step S3, obtains the satisfactory image of part of webpage the inside for the webpage of the 3rd type according to size; By step S6, utilize the image classification device that image is discerned one by one, the result of identification is (N ₁, N ₂), N wherein ₁For recognition result is responsive image number, N ₂For recognition result is normal image number.Whether be responsive priori as image simultaneously,, use and text is differentiated that the result of output is: P to the text of webpage the inside at the Bayes sorter of discrete text according to step S5 _sAccording to step S7, utilize two parameters to describe image classification device: P ₁Represent a secondary normal picture mistake is divided into the probability of sensitive image, P ₂Represent a secondary sensitive image mistake is divided into the probability of normal picture, three following formula of parameter substitution merge:

\frac{{(1 - p_{2})}^{N_{1}} {p_{2}}^{N_{2}}}{{p_{1}}^{N_{1}} {(1 - p_{1})}^{N_{2}}} * \frac{P_{s}}{1 - P_{s}} - - - (1 - 1)

The above-mentioned formula of each sorter output valve substitution, result calculated and threshold judge whether this webpage is sensitive web page.

In the foregoing description, each step is example, and those of ordinary skills can determine the actual step that will use according to actual conditions, and the realization of each step has several different methods, all should belong within the scope of the present invention.

Explanation at last: top description is to be used to realize the present invention and embodiment, and scope of the present invention should not described by this and limit.It should be appreciated by those skilled in the art,, all belong to claim of the present invention and come restricted portion in any modification or partial replacement that does not depart from the scope of the present invention.

Claims

1, a kind of content-based sensitive web page identification method comprises step:

Pre-treatment step comprises:

Obtain the structural information of image section in the webpage, select significance map and look like to form effective image collection;

Webpage sensitive information identification step comprises:

According to the described content-based sensitive web page identification method of claim 1, it is characterized in that 2, described identification sensitive information step is as follows:

3, according to the described content-based sensitive web page identification method of claim 1, it is characterized in that, pick out important image step and comprise:

4, according to the described content-based sensitive web page identification method of claim 1, it is characterized in that, utilize continuous responsive text identification device identification text step to comprise:

Extract the feature of the text;

Text feature is input in the support vector machine that has trained in advance, and the output result is that 1 text is responsive, disposes, otherwise continues to handle.

5, according to the described content-based sensitive web page identification method of claim 1, it is characterized in that, utilize discrete responsive text identification device identification text step to comprise:

Utilize vector space model to extract the feature of the text;

Text feature is input in the Bayesian network that has trained, and the result of output is the responsive probability of text input, if probable value greater than threshold tau, then text be responsive, disposes, otherwise continues processing.

According to the described content-based sensitive web page identification method of claim 1, it is characterized in that 6, image recognition and information fusion step comprise:

The result of discrete text identification and the result of above-mentioned image recognition merge, if the result greater than 1, then this webpage be a sensitivity, otherwise is normally, disposes.