CN103617156A

CN103617156A - Multi-protocol network file content inspection method

Info

Publication number: CN103617156A
Application number: CN201310567527.8A
Authority: CN
Inventors: 刘功申; 丁宵云; 苏波; 孟魁; 宁蔚
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2013-11-14
Filing date: 2013-11-14
Publication date: 2014-03-05

Abstract

The invention provides a multi-protocol network file content inspection method applicable to detecting sensitive information in network traffic based on a feature vector machine with simplified features. The method includes: recognizing a network protocol of a data packet, and subjecting the data packet to recombination, decoding, text extraction and text restoration; subjecting the restored text to word segmentation, and extracting feature vectors by a feature simplification algorithm. The feature simplification algorithm includes a file-based frequency method, an information gain method and a.

Description

Multiprotocol network file content inspection method

Technical field

The present invention relates to the method in network information technology field, be specifically related to a kind of multiprotocol network file content inspection method, more specifically relate to a kind of method based on simplifying the sensitive information in the Sampling network flow of a category feature vector machine of feature.

Background technology

Internet, in nearly fast development decades, makes network become informationalized important component part, yet the thing followed is but very different information, is flooded with the Internet space.Traditional network traffics sensitive information detection method can only test section uncoded or non-out of order packet, also all based on string matching program, realize detecting this part information.But along with the renewal day by day of network service, traditional text sensitive information detection method can not meet the demand in epoch.The shortcoming of traditional detection method be mainly reflected in following some:

1, cannot process coding or the out of order packet arriving at

Many procotols are for compressed transmission data size, or the accuracy of assurance transmission, often by some coded system of agreement, transmit packet.Traditional detection information can not be understood the protocol format that transmits both sides, therefore cannot be correctly to decoding data.And different and out of order for the selection due to network path, the packet that repeats to arrive at, cannot recombinate to obtain raw information especially.

2, mate in full waste resource

Conventional art carries out mating and just showing whether it comprises the conclusion of flame in full for entering intrasystem text, although researchist is for Optimizing Search difficulty, KMP algorithm has been proposed, Boyer Moore algorithm etc., reduced the time complexity that system is processed, but the poorest in the situation that, complexity is still at O (m*n).

3, negative characteristics need to pre-define

In order to detect bad text, conventional art must pre-define needs sensitive information to be filtered, and this just needs a huge flame database as basis.Yet once there be new flame to occur, the renewal of database lags behind often, this just makes detection system there is no good real-time.

4, the robustness detecting for flame is not strong

In order to deal with detection system, text is often configured to there is difference slightly with flame database, but the pattern that people can identify.For example use space that sensitive words is separated, use malapropism etc., this is just for structure flame database has formed difficulty.

Although researchist uses this concept of classification to solve the problem that this mass data is excavated, and has proposed the model of a class support vector machines, but comes with some shortcomings when practical application.Wherein relatively more outstanding is exactly some dimension blast.This is because the word amount comprising in text is very large, the < < modern Chinese dictionary > > the 5th edition (in May, 2005 publication) that the Commercial Press publishes, wherein included 65000 words, using so high-dimensional is a kind of serious waste to storage resources and computing power.

Summary of the invention

Technical matters to be solved by this invention is for there being above-mentioned defect in prior art, a kind of new method based on simplifying the sensitive information in the Sampling network flow of a category feature vector machine of feature is provided, and the method can solve the problem that data traditional detection method faces well.

In order to realize above-mentioned technical purpose, according to the present invention, a kind of multiprotocol network file content inspection method is provided, for the category feature vector machine based on simplifying feature, carry out the sensitive information of Sampling network flow, it comprises: the procotol of identification data bag first, carry out packet restructuring, decoding, text extracts and restore; Then, for the text restoring, carry out participle, the brief algorithm of use characteristic extracts proper vector, and classifies.

Preferably, proper vector is some nouns and verb.

Preferably, the brief algorithm of feature comprises respectively based on document frequency method, Information Gain Method, evolution and fits the method for inspection.

Preferably, the number of documents occurring in a classification based on document frequency method use characteristic word represents this Feature Words and such other degree of correlation, and the possibility that the Feature Words occurring in the more documents in certain classification is retained is larger.

Preferably, the difference that Information Gain Method is introduced this feature by computing system and do not introduced the front and back quantity of information of this feature defines the quantity of information that this feature brings to system and is used as it to detecting the foundation of certain classification.

Preferably, evolution fits the method for inspection and determines and suppose that whether the supposition that this feature has a significant impact system is correct by observing actual value and the deviation of theoretical value.

According to the present invention, a kind of multiprotocol network file content inspection method is provided, for the category feature vector machine based on simplifying feature, carry out the sensitive information of Sampling network flow, it comprises:

The first step, is used the text database that has completed artificial mark, and it is carried out to participle, extracts all nouns and verb as candidate feature vector;

Second step, the brief algorithm of use characteristic extracts candidate feature vector;

The 3rd step, is used a class support vector machines to train completing the text database of artificial mark, wherein uses the proper vector extracting from institute's directed quantity in second step, obtains thus the standard of classification;

The 4th step, the host-host protocol of specified data bag, and for the definition of different transport layers and application layer protocol, extract, restore text message according to RFC;

The 5th step, the text message for the recovery in the 4th step, carries out participle, and proper vector is extracted; Then whether according to the training result in the 3rd step, use SVM to classify, detecting it is bad text.

Preferably, extract all nouns and verb as candidate feature vector.

Preferably, in second step, the brief algorithm of feature extracts candidate feature vector, it is characterized in that only extracting for the larger proper vector of systematic influence, the brief algorithm of feature comprises based on document frequency method, Information Gain Method, evolution and fits the method for inspection, is specially:

(1), based on document frequency method

Algorithm counts the non-word frequency of occurrences of stopping word in all databases, then according to the frequency of occurrences, sort, select to occur that maximum several carry out dimension mapping as Feature Words for SVM algorithm, the accurate rate that the quantity of specifically choosing needs according to system, wherein first each piece of article in database carried out to participle, only retain noun and verb as the alternative word of Feature Words, then each is not present in to the alternative word of stopping in vocabulary and carries out quantity statistics, be recorded in frequency meter, finally, the alternative word occurring in frequency meter is sorted according to the number that occurs quantity, n the Feature Words obtaining as DF algorithm before selecting, algorithm finishes,

(2), Information Gain Method

The alternative word that each pre-service is obtained is carried out the calculating of the value of information entropy and conditional entropy, after each alternative word having been carried out to the calculated value of introducing entropy, according to this value, sort from big to small, n the Feature Words obtaining as IG algorithm before selecting, algorithm finishes;

Wherein, described information entropy, can calculate with following formula

Wherein, x _ibe i feature, p (x _i) expression x _ithe frequency occurring;

Described conditional entropy, can calculate with following formula

Wherein,

representation feature x _jduring appearance, classification c _ithe probability occurring,

representation feature x _jwhile not occurring, classification c _ithe probability occurring;

The entropy that described alternative word is introduced, specifically refers to the information entropy of system and the difference of conditional entropy, can calculate with following formula

Wherein,

represent x _ithe frequency occurring,

represent x _ithere is no the frequency occurring, H (c) represents the information entropy of two classifications, p (c _i| x _i) representation feature x _iduring appearance, classification c _ithe probability occurring, representation feature x _jwhile not occurring, classification c _ithe probability occurring;

(3), evolution fits the method for inspection

Calculate each alternative word and appear at the record A in normal class, the record B in the normal class not occurring, appears at the record C in improper class, and does not appear at the record D in improper class; Finally by following formula, calculate weights,

According to this value, sort from big to small, before selecting, n fits as evolution the Feature Words that the method for inspection obtains, and algorithm finishes.

Preferably, in the 4th step, according to the value of the offset field in packet, determine the network layer of text message and the agreement that application layer is used; The order that information in network layer is used for confirming to packet, so that according to the information of the correct sequence reduction application layer of its original transmission, the information specific definition coded system in application layer.

Accompanying drawing explanation

By reference to the accompanying drawings, and by reference to detailed description below, will more easily to the present invention, there is more complete understanding and more easily understand its advantage of following and feature, wherein:

Fig. 1 schematically shows according to the process flow diagram of the multiprotocol network file content inspection method of the embodiment of the present invention.

Fig. 2 schematically shows the example of the packet adopting according to the embodiment of the present invention.

Fig. 3 schematically shows the decoding example according to the embodiment of the present invention.

It should be noted that, accompanying drawing is used for illustrating the present invention, and unrestricted the present invention.Note, the accompanying drawing that represents structure may not be to draw in proportion.And in accompanying drawing, identical or similar element indicates identical or similar label.

Embodiment

In order to make content of the present invention more clear and understandable, below in conjunction with specific embodiments and the drawings, content of the present invention is described in detail.

The invention provides a kind of multiprotocol network file content inspection method, for the category feature vector machine based on simplifying feature, carry out the sensitive information of Sampling network flow, the procotol of identification data bag first wherein, carries out packet restructuring, decoding, text extracts and restore; Then, for the text restoring, carry out participle, the brief algorithm of use characteristic extracts proper vector, and classifies.

To specifically describe specific embodiments of the invention below.

As shown in Figure 1, according to the multiprotocol network file content inspection method of the embodiment of the present invention, for the category feature vector machine based on simplifying feature, carry out the sensitive information of Sampling network flow, comprising:

First step S1, is used the text database that has completed artificial mark, and it is carried out to participle, extracts all nouns and verb as candidate feature vector.

Described participle, is specially, and according to the grammer of passage, it is cut apart, the word that then each split mark attribute, noun for example, verb, adverbial word etc.

Described proper vector, is specially, some nouns and verb, and they will be used as vector in SVM (SupportVector Machine, the support vector machine) training by afterwards.

Second step S2, is used three kinds of brief algorithms of feature to accept or reject candidate feature vector, only extracts for the larger proper vector of systematic influence.

Three kinds of brief algorithms of feature are respectively to fit the method for inspection (CHI statistics) based on document frequency (Document Frequency) method, information gain (Information Gain, IG) method, evolution.

The extraction of described proper vector, refers at the text in existing storehouse and carries out participle, and all nouns that obtain and verb are as candidate feature vector.In all candidate feature vectors, according to certain extracting method, select the proper vector of part as the proper vector of the required use of subsequent detection flame.

Described based on document frequency method, refer to that the number of documents that use characteristic word occurs in a classification represents this Feature Words and such other degree of correlation.The possibility that the Feature Words occurring in more documents in certain classification is retained is larger.During concrete operations, algorithm need to count the non-word frequency of occurrences of stopping word in all databases, then according to the frequency of occurrences, sort, select to occur that maximum several carry out dimension mapping as Feature Words for SVM algorithm, the accurate rate that the quantity of specifically choosing needs according to system can choose 500,1000 etc., the calculated amount can bear according to system is set.In practical operation, first each piece of article in database carried out to participle, only retain noun and verb as the alternative word of Feature Words.Then each is not present in to the alternative word of stopping in vocabulary and carries out quantity statistics, be recorded in frequency meter.These above steps are also the pre-service of IG and CHI algorithm, in introduction below, have not just been repeated in this description.Finally, by the alternative word occurring in frequency meter according to occur quantity number sort, n the Feature Words obtaining as DF algorithm before selecting, algorithm finishes.

Described Information Gain Method, refers to that the difference of being introduced this feature and not introduced the front and back quantity of information of this feature by computing system defines the quantity of information that this feature brings to system and is used as it to detecting the foundation of certain classification.During concrete operations, the alternative word that need to obtain each pre-service is carried out the calculating of the value of information entropy and conditional entropy.It should be noted that herein, because needs are applied in a class SVM algorithm, the frequency of occurrences of each classification is widely different herein, therefore calculate P (c _i) time need to consider the quantity of different classes of middle document.After each alternative word having been carried out to the calculated value of introducing entropy, according to this value, sort from big to small, n the Feature Words obtaining as IG algorithm before selecting, algorithm finishes.

Described information entropy, can calculate with following formula

H (c) = - Σ_{l = 1}^{n} p (c_{i}) \log p (c_{i})

Wherein, x _ibe i feature, p (x _i) expression x _ithe frequency occurring.

Described conditional entropy, can calculate with following formula

Wherein, p (c _i| x _j) representation feature x _jduring appearance, classification c _ithe probability occurring,

representation feature x _jwhile not occurring, classification c _ithe probability occurring.

Wherein, p (x _i) expression x _ithe frequency occurring, represent x _ithere is no the frequency occurring, H (c) represents the information entropy of two classifications, p (c _i| x _i) representation feature x _iduring appearance, classification c _ithe probability occurring,

Described evolution fits the method for inspection, refers to by observing actual value and the deviation of theoretical value and determines and suppose that whether the supposition that this feature has a significant impact system is correct.During concrete operations, calculate each alternative word and appear at the record A in normal class, the record B in the normal class not occurring, appears at the record C in improper class, and does not appear at the record D in improper class.Finally by following formula, calculate weights.According to this value, sort from big to small, n the Feature Words obtaining as CHI algorithm before selecting, algorithm finishes.

The 3rd step S3, is used a class support vector machines to train completing the text database of artificial mark, uses the proper vector extracting from institute's directed quantity from second step S2, obtains the standard of classification.

The 4th step S4, the host-host protocol of specified data bag, and according to RFC(Request For Comments, Request for Comment) for the definition of different transport layers and application layer protocol, extract, restore text message.

According to RFC definition, system is determined the network layer of text message and the agreement that application layer is used according to the value of the offset field in packet.Information in network layer can be for confirming the order of packet, so that according to the information of the correct sequence reduction application layer of its original transmission.And information in application layer, the coded system that specific definition is concrete.After application layer in identification data bag, according to the definition of RFC, recover its content.

The 5th step S5, the text message for the recovery in the 4th step S4, carries out participle, and proper vector is extracted; Then whether according to the training result in the 3rd step S3, use SVM to classify, detecting it is bad text.

Advantage of the present invention: realize a new method based on simplifying the sensitive information in the Sampling network flow of a category feature vector machine of feature, what the method had overcome that traditional acquisition method faces cannot process coding or the out of order packet arriving at, mate in full, waste resource, the shortcomings such as flame keyword need to pre-establish have strengthened the robustness detecting simultaneously.The present invention can provide recovery and the storage of multi-protocols transmission data for network information center (NIC), also can provide for Network Security Centre identification and the filtration of flame.

Below embodiments of the invention are elaborated.The present embodiment is implemented take technical solution of the present invention under prerequisite, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.

The collected side of the present embodiment be take typical forum as example, and specific embodiment comprises the following steps:

The first step, is used the text database that has completed artificial mark, and it is carried out to participle, extracts all nouns and verb as candidate feature vector.

For example, happy birthday in sentence “Zhu motherland, and arrived National Day at once, wishes motherland thriving and prosperous.”。

Result after participle " wish/v motherland/n birthday/n is happy/an ,/un National Day/t horse back/n to/v/v, wish/v of/un motherland/n prosperity/i ".

Wherein v is verb, and n is noun, and other casts out irrespective word for adjective adverbial word etc.

Second step, is used three kinds of brief algorithms of feature to accept or reject candidate feature vector, only extracts for the larger proper vector of systematic influence.

Described based on document frequency method, refer to that the number of documents that use characteristic word occurs in a classification represents this Feature Words and such other degree of correlation.

Described Information Gain Method, refers to that the difference of being introduced this feature and not introduced the front and back quantity of information of this feature by computing system defines the quantity of information that this feature brings to system and is used as it to detecting the foundation of certain classification

Described evolution fits the method for inspection, refers to by observing actual value and the deviation of theoretical value and determines and suppose that whether the supposition that this feature has a significant impact system is correct.

The 3rd step, used a class support vector machines to train completing the text database of artificial mark, and the proper vector of use is and from institute's directed quantity, extracts in second step, obtains the standard of classification.

In to the training of flame database, can obtain front 20 proper vectors as shown in the table and extract.

The 4th step, the host-host protocol of specified data bag, and the definition for different transport layers and application layer protocol according to RFC, extract, restore text message.

For the present embodiment, system can to shown in Fig. 2 packet recombinate, extract decoding, reduction.

1, protocol identification.Packet is that procotol is tcp agreement.Resolve mutual port used in tcp agreement, be found to be 25 ports, application layer is smtp agreement.

2, application layer smtp is followed the trail of to parsing, ignore its command interaction part, take out its data division.Because this application layer data is larger, network has been used burst transmission, therefore need data division to recombinate.

3, decoding.Command interaction to application layer is resolved, and finds that it has used the mode of base64 to encode to application layer data.Therefore when save data bag content of text, need to automatically carry out the decode operation of base64 to the raw information extracting in advance from packet, as shown in Figure 3.The Mail Contents this time recovering in embodiment is, " flourishing start great master ", text annex is the document of a doc form, content is wherein " contract ".

The 5th step, the text message for the recovery in the 4th step, carries out participle, and proper vector is extracted.Then whether according to the training result in the 3rd step, use SVM to classify, detecting it is bad text.

Through the classification of SVM, the document classifies as normal text.

The present embodiment has been described after typical network traffics crawl, classification process based on simplifying a category feature vector machine of feature, the proper vector obtaining after brief algorithm process through feature in use, as the training vector of SVM, draws and detects whether text is the basis for estimation of flame.Then network traffics are carried out to identification restructuring automatically, decoding, preserves, and finally text is sorted out, to reach the effect of detection information.

Application prospect of the present invention comprises two large classes: the one, and for providing multi-protocols, network information center (NIC) transmits recovery and the storage of data; The 2nd, for Network Security Centre provides identification and the filtration of flame.

In addition, it should be noted that, unless stated otherwise or point out, otherwise the descriptions such as the term in instructions " first ", " second ", " the 3rd " are only for distinguishing each assembly, element, step of instructions etc., rather than for representing logical relation between each assembly, element, step or ordinal relation etc.

Be understandable that, although the present invention with preferred embodiment disclosure as above, yet above-described embodiment is not in order to limit the present invention.For any those of ordinary skill in the art, do not departing from technical solution of the present invention scope situation, all can utilize the technology contents of above-mentioned announcement to make many possible changes and modification to technical solution of the present invention, or be revised as the equivalent embodiment of equivalent variations.Therefore, every content that does not depart from technical solution of the present invention,, all still belongs in the scope of technical solution of the present invention protection any simple modification made for any of the above embodiments, equivalent variations and modification according to technical spirit of the present invention.

Claims

1. a multiprotocol network file content inspection method, for the category feature vector machine based on simplifying feature, carry out the sensitive information of Sampling network flow, it is characterized in that comprising: the procotol of identification data bag first, carry out packet restructuring, decoding, text extracts and restore; Then, for the text restoring, carry out participle, the brief algorithm of use characteristic extracts proper vector, and classifies.

2. multiprotocol network file content inspection method according to claim 1, is characterized in that, proper vector is some nouns and verb.

3. multiprotocol network file content inspection method according to claim 1 and 2, is characterized in that, the brief algorithm of feature comprises respectively based on document frequency method, Information Gain Method, evolution and fits the method for inspection.

4. multiprotocol network file content inspection method according to claim 3, it is characterized in that, the number of documents occurring in a classification based on document frequency method use characteristic word represents this Feature Words and such other degree of correlation, and the possibility that the Feature Words occurring in the more documents in certain classification is retained is larger.

5. multiprotocol network file content inspection method according to claim 3, it is characterized in that, the difference that Information Gain Method is introduced this feature by computing system and do not introduced the front and back quantity of information of this feature defines the quantity of information that this feature brings to system and is used as it to detecting the foundation of certain classification.

6. multiprotocol network file content inspection method according to claim 3, is characterized in that, evolution fits the method for inspection and determines and suppose that whether the supposition that this feature has a significant impact system is correct by observing actual value and the deviation of theoretical value.

7. a multiprotocol network file content inspection method, carrys out the sensitive information of Sampling network flow for the category feature vector machine based on simplifying feature, it is characterized in that comprising:

8. multiprotocol network file content inspection method according to claim 7, is characterized in that, extracts all nouns and verb as candidate feature vector.

9. according to the multiprotocol network file content inspection method described in claim 7 or 8, it is characterized in that, in second step, the brief algorithm of feature extracts candidate feature vector, it is characterized in that only extracting for the larger proper vector of systematic influence, the brief algorithm of feature comprises based on document frequency method, Information Gain Method, evolution and fits the method for inspection, is specially:

(1), based on document frequency method

(2), Information Gain Method

Wherein, described information entropy, can calculate with following formula

Wherein, c _ibe i feature, p (c _i) expression c _ithe frequency occurring;

Described conditional entropy, can calculate with following formula

Wherein, p (x _i) expression x _ithe frequency occurring,

(3), evolution fits the method for inspection

10. according to the multiprotocol network file content inspection method described in claim 7 or 8, it is characterized in that, in the 4th step, according to the value of the offset field in packet, determine the network layer of text message and the agreement that application layer is used; The order that information in network layer is used for confirming to packet, so that according to the information of the correct sequence reduction application layer of its original transmission, the information specific definition coded system in application layer.