CN103617156A - Multi-protocol network file content inspection method - Google Patents

Multi-protocol network file content inspection method Download PDF

Info

Publication number
CN103617156A
CN103617156A CN201310567527.8A CN201310567527A CN103617156A CN 103617156 A CN103617156 A CN 103617156A CN 201310567527 A CN201310567527 A CN 201310567527A CN 103617156 A CN103617156 A CN 103617156A
Authority
CN
China
Prior art keywords
feature
information
occurring
algorithm
file content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310567527.8A
Other languages
Chinese (zh)
Inventor
刘功申
丁宵云
苏波
孟魁
宁蔚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201310567527.8A priority Critical patent/CN103617156A/en
Publication of CN103617156A publication Critical patent/CN103617156A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a multi-protocol network file content inspection method applicable to detecting sensitive information in network traffic based on a feature vector machine with simplified features. The method includes: recognizing a network protocol of a data packet, and subjecting the data packet to recombination, decoding, text extraction and text restoration; subjecting the restored text to word segmentation, and extracting feature vectors by a feature simplification algorithm. The feature simplification algorithm includes a file-based frequency method, an information gain method and a.

Description

Multiprotocol network file content inspection method
Technical field
The present invention relates to the method in network information technology field, be specifically related to a kind of multiprotocol network file content inspection method, more specifically relate to a kind of method based on simplifying the sensitive information in the Sampling network flow of a category feature vector machine of feature.
Background technology
Internet, in nearly fast development decades, makes network become informationalized important component part, yet the thing followed is but very different information, is flooded with the Internet space.Traditional network traffics sensitive information detection method can only test section uncoded or non-out of order packet, also all based on string matching program, realize detecting this part information.But along with the renewal day by day of network service, traditional text sensitive information detection method can not meet the demand in epoch.The shortcoming of traditional detection method be mainly reflected in following some:
1, cannot process coding or the out of order packet arriving at
Many procotols are for compressed transmission data size, or the accuracy of assurance transmission, often by some coded system of agreement, transmit packet.Traditional detection information can not be understood the protocol format that transmits both sides, therefore cannot be correctly to decoding data.And different and out of order for the selection due to network path, the packet that repeats to arrive at, cannot recombinate to obtain raw information especially.
2, mate in full waste resource
Conventional art carries out mating and just showing whether it comprises the conclusion of flame in full for entering intrasystem text, although researchist is for Optimizing Search difficulty, KMP algorithm has been proposed, Boyer Moore algorithm etc., reduced the time complexity that system is processed, but the poorest in the situation that, complexity is still at O (m*n).
3, negative characteristics need to pre-define
In order to detect bad text, conventional art must pre-define needs sensitive information to be filtered, and this just needs a huge flame database as basis.Yet once there be new flame to occur, the renewal of database lags behind often, this just makes detection system there is no good real-time.
4, the robustness detecting for flame is not strong
In order to deal with detection system, text is often configured to there is difference slightly with flame database, but the pattern that people can identify.For example use space that sensitive words is separated, use malapropism etc., this is just for structure flame database has formed difficulty.
Although researchist uses this concept of classification to solve the problem that this mass data is excavated, and has proposed the model of a class support vector machines, but comes with some shortcomings when practical application.Wherein relatively more outstanding is exactly some dimension blast.This is because the word amount comprising in text is very large, the < < modern Chinese dictionary > > the 5th edition (in May, 2005 publication) that the Commercial Press publishes, wherein included 65000 words, using so high-dimensional is a kind of serious waste to storage resources and computing power.
Summary of the invention
Technical matters to be solved by this invention is for there being above-mentioned defect in prior art, a kind of new method based on simplifying the sensitive information in the Sampling network flow of a category feature vector machine of feature is provided, and the method can solve the problem that data traditional detection method faces well.
In order to realize above-mentioned technical purpose, according to the present invention, a kind of multiprotocol network file content inspection method is provided, for the category feature vector machine based on simplifying feature, carry out the sensitive information of Sampling network flow, it comprises: the procotol of identification data bag first, carry out packet restructuring, decoding, text extracts and restore; Then, for the text restoring, carry out participle, the brief algorithm of use characteristic extracts proper vector, and classifies.
Preferably, proper vector is some nouns and verb.
Preferably, the brief algorithm of feature comprises respectively based on document frequency method, Information Gain Method, evolution and fits the method for inspection.
Preferably, the number of documents occurring in a classification based on document frequency method use characteristic word represents this Feature Words and such other degree of correlation, and the possibility that the Feature Words occurring in the more documents in certain classification is retained is larger.
Preferably, the difference that Information Gain Method is introduced this feature by computing system and do not introduced the front and back quantity of information of this feature defines the quantity of information that this feature brings to system and is used as it to detecting the foundation of certain classification.
Preferably, evolution fits the method for inspection and determines and suppose that whether the supposition that this feature has a significant impact system is correct by observing actual value and the deviation of theoretical value.
According to the present invention, a kind of multiprotocol network file content inspection method is provided, for the category feature vector machine based on simplifying feature, carry out the sensitive information of Sampling network flow, it comprises:
The first step, is used the text database that has completed artificial mark, and it is carried out to participle, extracts all nouns and verb as candidate feature vector;
Second step, the brief algorithm of use characteristic extracts candidate feature vector;
The 3rd step, is used a class support vector machines to train completing the text database of artificial mark, wherein uses the proper vector extracting from institute's directed quantity in second step, obtains thus the standard of classification;
The 4th step, the host-host protocol of specified data bag, and for the definition of different transport layers and application layer protocol, extract, restore text message according to RFC;
The 5th step, the text message for the recovery in the 4th step, carries out participle, and proper vector is extracted; Then whether according to the training result in the 3rd step, use SVM to classify, detecting it is bad text.
Preferably, extract all nouns and verb as candidate feature vector.
Preferably, in second step, the brief algorithm of feature extracts candidate feature vector, it is characterized in that only extracting for the larger proper vector of systematic influence, the brief algorithm of feature comprises based on document frequency method, Information Gain Method, evolution and fits the method for inspection, is specially:
(1), based on document frequency method
Algorithm counts the non-word frequency of occurrences of stopping word in all databases, then according to the frequency of occurrences, sort, select to occur that maximum several carry out dimension mapping as Feature Words for SVM algorithm, the accurate rate that the quantity of specifically choosing needs according to system, wherein first each piece of article in database carried out to participle, only retain noun and verb as the alternative word of Feature Words, then each is not present in to the alternative word of stopping in vocabulary and carries out quantity statistics, be recorded in frequency meter, finally, the alternative word occurring in frequency meter is sorted according to the number that occurs quantity, n the Feature Words obtaining as DF algorithm before selecting, algorithm finishes,
(2), Information Gain Method
The alternative word that each pre-service is obtained is carried out the calculating of the value of information entropy and conditional entropy, after each alternative word having been carried out to the calculated value of introducing entropy, according to this value, sort from big to small, n the Feature Words obtaining as IG algorithm before selecting, algorithm finishes;
Wherein, described information entropy, can calculate with following formula
Figure BDA0000413616510000041
Wherein, x ibe i feature, p (x i) expression x ithe frequency occurring;
Described conditional entropy, can calculate with following formula
Wherein,
Figure BDA0000413616510000045
representation feature x jduring appearance, classification c ithe probability occurring,
Figure BDA0000413616510000044
representation feature x jwhile not occurring, classification c ithe probability occurring;
The entropy that described alternative word is introduced, specifically refers to the information entropy of system and the difference of conditional entropy, can calculate with following formula
Figure BDA0000413616510000043
Wherein,
Figure BDA0000413616510000047
represent x ithe frequency occurring,
Figure BDA0000413616510000046
represent x ithere is no the frequency occurring, H (c) represents the information entropy of two classifications, p (c i| x i) representation feature x iduring appearance, classification c ithe probability occurring, representation feature x jwhile not occurring, classification c ithe probability occurring;
(3), evolution fits the method for inspection
Calculate each alternative word and appear at the record A in normal class, the record B in the normal class not occurring, appears at the record C in improper class, and does not appear at the record D in improper class; Finally by following formula, calculate weights,
Figure BDA0000413616510000051
According to this value, sort from big to small, before selecting, n fits as evolution the Feature Words that the method for inspection obtains, and algorithm finishes.
Preferably, in the 4th step, according to the value of the offset field in packet, determine the network layer of text message and the agreement that application layer is used; The order that information in network layer is used for confirming to packet, so that according to the information of the correct sequence reduction application layer of its original transmission, the information specific definition coded system in application layer.
Accompanying drawing explanation
By reference to the accompanying drawings, and by reference to detailed description below, will more easily to the present invention, there is more complete understanding and more easily understand its advantage of following and feature, wherein:
Fig. 1 schematically shows according to the process flow diagram of the multiprotocol network file content inspection method of the embodiment of the present invention.
Fig. 2 schematically shows the example of the packet adopting according to the embodiment of the present invention.
Fig. 3 schematically shows the decoding example according to the embodiment of the present invention.
It should be noted that, accompanying drawing is used for illustrating the present invention, and unrestricted the present invention.Note, the accompanying drawing that represents structure may not be to draw in proportion.And in accompanying drawing, identical or similar element indicates identical or similar label.
Embodiment
In order to make content of the present invention more clear and understandable, below in conjunction with specific embodiments and the drawings, content of the present invention is described in detail.
The invention provides a kind of multiprotocol network file content inspection method, for the category feature vector machine based on simplifying feature, carry out the sensitive information of Sampling network flow, the procotol of identification data bag first wherein, carries out packet restructuring, decoding, text extracts and restore; Then, for the text restoring, carry out participle, the brief algorithm of use characteristic extracts proper vector, and classifies.
To specifically describe specific embodiments of the invention below.
Fig. 1 schematically shows according to the process flow diagram of the multiprotocol network file content inspection method of the embodiment of the present invention.
As shown in Figure 1, according to the multiprotocol network file content inspection method of the embodiment of the present invention, for the category feature vector machine based on simplifying feature, carry out the sensitive information of Sampling network flow, comprising:
First step S1, is used the text database that has completed artificial mark, and it is carried out to participle, extracts all nouns and verb as candidate feature vector.
Described participle, is specially, and according to the grammer of passage, it is cut apart, the word that then each split mark attribute, noun for example, verb, adverbial word etc.
Described proper vector, is specially, some nouns and verb, and they will be used as vector in SVM (SupportVector Machine, the support vector machine) training by afterwards.
Second step S2, is used three kinds of brief algorithms of feature to accept or reject candidate feature vector, only extracts for the larger proper vector of systematic influence.
Three kinds of brief algorithms of feature are respectively to fit the method for inspection (CHI statistics) based on document frequency (Document Frequency) method, information gain (Information Gain, IG) method, evolution.
The extraction of described proper vector, refers at the text in existing storehouse and carries out participle, and all nouns that obtain and verb are as candidate feature vector.In all candidate feature vectors, according to certain extracting method, select the proper vector of part as the proper vector of the required use of subsequent detection flame.
Described based on document frequency method, refer to that the number of documents that use characteristic word occurs in a classification represents this Feature Words and such other degree of correlation.The possibility that the Feature Words occurring in more documents in certain classification is retained is larger.During concrete operations, algorithm need to count the non-word frequency of occurrences of stopping word in all databases, then according to the frequency of occurrences, sort, select to occur that maximum several carry out dimension mapping as Feature Words for SVM algorithm, the accurate rate that the quantity of specifically choosing needs according to system can choose 500,1000 etc., the calculated amount can bear according to system is set.In practical operation, first each piece of article in database carried out to participle, only retain noun and verb as the alternative word of Feature Words.Then each is not present in to the alternative word of stopping in vocabulary and carries out quantity statistics, be recorded in frequency meter.These above steps are also the pre-service of IG and CHI algorithm, in introduction below, have not just been repeated in this description.Finally, by the alternative word occurring in frequency meter according to occur quantity number sort, n the Feature Words obtaining as DF algorithm before selecting, algorithm finishes.
Described Information Gain Method, refers to that the difference of being introduced this feature and not introduced the front and back quantity of information of this feature by computing system defines the quantity of information that this feature brings to system and is used as it to detecting the foundation of certain classification.During concrete operations, the alternative word that need to obtain each pre-service is carried out the calculating of the value of information entropy and conditional entropy.It should be noted that herein, because needs are applied in a class SVM algorithm, the frequency of occurrences of each classification is widely different herein, therefore calculate P (c i) time need to consider the quantity of different classes of middle document.After each alternative word having been carried out to the calculated value of introducing entropy, according to this value, sort from big to small, n the Feature Words obtaining as IG algorithm before selecting, algorithm finishes.
Described information entropy, can calculate with following formula
H ( c ) = - &Sigma; l = 1 n p ( c i ) log p ( c i )
Wherein, x ibe i feature, p (x i) expression x ithe frequency occurring.
Described conditional entropy, can calculate with following formula
Figure BDA0000413616510000072
Wherein, p (c i| x j) representation feature x jduring appearance, classification c ithe probability occurring,
Figure BDA0000413616510000073
representation feature x jwhile not occurring, classification c ithe probability occurring.
The entropy that described alternative word is introduced, specifically refers to the information entropy of system and the difference of conditional entropy, can calculate with following formula
Wherein, p (x i) expression x ithe frequency occurring, represent x ithere is no the frequency occurring, H (c) represents the information entropy of two classifications, p (c i| x i) representation feature x iduring appearance, classification c ithe probability occurring,
Figure DEST_PATH_GDA0000447144760000076
representation feature x jwhile not occurring, classification c ithe probability occurring;
Described evolution fits the method for inspection, refers to by observing actual value and the deviation of theoretical value and determines and suppose that whether the supposition that this feature has a significant impact system is correct.During concrete operations, calculate each alternative word and appear at the record A in normal class, the record B in the normal class not occurring, appears at the record C in improper class, and does not appear at the record D in improper class.Finally by following formula, calculate weights.According to this value, sort from big to small, n the Feature Words obtaining as CHI algorithm before selecting, algorithm finishes.
Figure BDA0000413616510000082
The 3rd step S3, is used a class support vector machines to train completing the text database of artificial mark, uses the proper vector extracting from institute's directed quantity from second step S2, obtains the standard of classification.
The 4th step S4, the host-host protocol of specified data bag, and according to RFC(Request For Comments, Request for Comment) for the definition of different transport layers and application layer protocol, extract, restore text message.
According to RFC definition, system is determined the network layer of text message and the agreement that application layer is used according to the value of the offset field in packet.Information in network layer can be for confirming the order of packet, so that according to the information of the correct sequence reduction application layer of its original transmission.And information in application layer, the coded system that specific definition is concrete.After application layer in identification data bag, according to the definition of RFC, recover its content.
The 5th step S5, the text message for the recovery in the 4th step S4, carries out participle, and proper vector is extracted; Then whether according to the training result in the 3rd step S3, use SVM to classify, detecting it is bad text.
Advantage of the present invention: realize a new method based on simplifying the sensitive information in the Sampling network flow of a category feature vector machine of feature, what the method had overcome that traditional acquisition method faces cannot process coding or the out of order packet arriving at, mate in full, waste resource, the shortcomings such as flame keyword need to pre-establish have strengthened the robustness detecting simultaneously.The present invention can provide recovery and the storage of multi-protocols transmission data for network information center (NIC), also can provide for Network Security Centre identification and the filtration of flame.
Below embodiments of the invention are elaborated.The present embodiment is implemented take technical solution of the present invention under prerequisite, provided detailed embodiment and concrete operating process, but protection scope of the present invention is not limited to following embodiment.
The collected side of the present embodiment be take typical forum as example, and specific embodiment comprises the following steps:
The first step, is used the text database that has completed artificial mark, and it is carried out to participle, extracts all nouns and verb as candidate feature vector.
Described participle, is specially, and according to the grammer of passage, it is cut apart, the word that then each split mark attribute, noun for example, verb, adverbial word etc.
For example, happy birthday in sentence “Zhu motherland, and arrived National Day at once, wishes motherland thriving and prosperous.”。
Result after participle " wish/v motherland/n birthday/n is happy/an ,/un National Day/t horse back/n to/v/v, wish/v of/un motherland/n prosperity/i ".
Wherein v is verb, and n is noun, and other casts out irrespective word for adjective adverbial word etc.
Second step, is used three kinds of brief algorithms of feature to accept or reject candidate feature vector, only extracts for the larger proper vector of systematic influence.
Three kinds of brief algorithms of feature are respectively to fit the method for inspection (CHI statistics) based on document frequency (Document Frequency) method, information gain (Information Gain, IG) method, evolution.
Described based on document frequency method, refer to that the number of documents that use characteristic word occurs in a classification represents this Feature Words and such other degree of correlation.
Described Information Gain Method, refers to that the difference of being introduced this feature and not introduced the front and back quantity of information of this feature by computing system defines the quantity of information that this feature brings to system and is used as it to detecting the foundation of certain classification
Described evolution fits the method for inspection, refers to by observing actual value and the deviation of theoretical value and determines and suppose that whether the supposition that this feature has a significant impact system is correct.
The 3rd step, used a class support vector machines to train completing the text database of artificial mark, and the proper vector of use is and from institute's directed quantity, extracts in second step, obtains the standard of classification.
In to the training of flame database, can obtain front 20 proper vectors as shown in the table and extract.
Figure DEST_PATH_GDA0000447144760000091
Figure DEST_PATH_GDA0000447144760000101
The 4th step, the host-host protocol of specified data bag, and the definition for different transport layers and application layer protocol according to RFC, extract, restore text message.
For the present embodiment, system can to shown in Fig. 2 packet recombinate, extract decoding, reduction.
1, protocol identification.Packet is that procotol is tcp agreement.Resolve mutual port used in tcp agreement, be found to be 25 ports, application layer is smtp agreement.
2, application layer smtp is followed the trail of to parsing, ignore its command interaction part, take out its data division.Because this application layer data is larger, network has been used burst transmission, therefore need data division to recombinate.
3, decoding.Command interaction to application layer is resolved, and finds that it has used the mode of base64 to encode to application layer data.Therefore when save data bag content of text, need to automatically carry out the decode operation of base64 to the raw information extracting in advance from packet, as shown in Figure 3.The Mail Contents this time recovering in embodiment is, " flourishing start great master ", text annex is the document of a doc form, content is wherein " contract ".
The 5th step, the text message for the recovery in the 4th step, carries out participle, and proper vector is extracted.Then whether according to the training result in the 3rd step, use SVM to classify, detecting it is bad text.
Through the classification of SVM, the document classifies as normal text.
The present embodiment has been described after typical network traffics crawl, classification process based on simplifying a category feature vector machine of feature, the proper vector obtaining after brief algorithm process through feature in use, as the training vector of SVM, draws and detects whether text is the basis for estimation of flame.Then network traffics are carried out to identification restructuring automatically, decoding, preserves, and finally text is sorted out, to reach the effect of detection information.
Application prospect of the present invention comprises two large classes: the one, and for providing multi-protocols, network information center (NIC) transmits recovery and the storage of data; The 2nd, for Network Security Centre provides identification and the filtration of flame.
In addition, it should be noted that, unless stated otherwise or point out, otherwise the descriptions such as the term in instructions " first ", " second ", " the 3rd " are only for distinguishing each assembly, element, step of instructions etc., rather than for representing logical relation between each assembly, element, step or ordinal relation etc.
Be understandable that, although the present invention with preferred embodiment disclosure as above, yet above-described embodiment is not in order to limit the present invention.For any those of ordinary skill in the art, do not departing from technical solution of the present invention scope situation, all can utilize the technology contents of above-mentioned announcement to make many possible changes and modification to technical solution of the present invention, or be revised as the equivalent embodiment of equivalent variations.Therefore, every content that does not depart from technical solution of the present invention,, all still belongs in the scope of technical solution of the present invention protection any simple modification made for any of the above embodiments, equivalent variations and modification according to technical spirit of the present invention.

Claims (10)

1. a multiprotocol network file content inspection method, for the category feature vector machine based on simplifying feature, carry out the sensitive information of Sampling network flow, it is characterized in that comprising: the procotol of identification data bag first, carry out packet restructuring, decoding, text extracts and restore; Then, for the text restoring, carry out participle, the brief algorithm of use characteristic extracts proper vector, and classifies.
2. multiprotocol network file content inspection method according to claim 1, is characterized in that, proper vector is some nouns and verb.
3. multiprotocol network file content inspection method according to claim 1 and 2, is characterized in that, the brief algorithm of feature comprises respectively based on document frequency method, Information Gain Method, evolution and fits the method for inspection.
4. multiprotocol network file content inspection method according to claim 3, it is characterized in that, the number of documents occurring in a classification based on document frequency method use characteristic word represents this Feature Words and such other degree of correlation, and the possibility that the Feature Words occurring in the more documents in certain classification is retained is larger.
5. multiprotocol network file content inspection method according to claim 3, it is characterized in that, the difference that Information Gain Method is introduced this feature by computing system and do not introduced the front and back quantity of information of this feature defines the quantity of information that this feature brings to system and is used as it to detecting the foundation of certain classification.
6. multiprotocol network file content inspection method according to claim 3, is characterized in that, evolution fits the method for inspection and determines and suppose that whether the supposition that this feature has a significant impact system is correct by observing actual value and the deviation of theoretical value.
7. a multiprotocol network file content inspection method, carrys out the sensitive information of Sampling network flow for the category feature vector machine based on simplifying feature, it is characterized in that comprising:
The first step, is used the text database that has completed artificial mark, and it is carried out to participle, extracts all nouns and verb as candidate feature vector;
Second step, the brief algorithm of use characteristic extracts candidate feature vector;
The 3rd step, is used a class support vector machines to train completing the text database of artificial mark, wherein uses the proper vector extracting from institute's directed quantity in second step, obtains thus the standard of classification;
The 4th step, the host-host protocol of specified data bag, and for the definition of different transport layers and application layer protocol, extract, restore text message according to RFC;
The 5th step, the text message for the recovery in the 4th step, carries out participle, and proper vector is extracted; Then whether according to the training result in the 3rd step, use SVM to classify, detecting it is bad text.
8. multiprotocol network file content inspection method according to claim 7, is characterized in that, extracts all nouns and verb as candidate feature vector.
9. according to the multiprotocol network file content inspection method described in claim 7 or 8, it is characterized in that, in second step, the brief algorithm of feature extracts candidate feature vector, it is characterized in that only extracting for the larger proper vector of systematic influence, the brief algorithm of feature comprises based on document frequency method, Information Gain Method, evolution and fits the method for inspection, is specially:
(1), based on document frequency method
Algorithm counts the non-word frequency of occurrences of stopping word in all databases, then according to the frequency of occurrences, sort, select to occur that maximum several carry out dimension mapping as Feature Words for SVM algorithm, the accurate rate that the quantity of specifically choosing needs according to system, wherein first each piece of article in database carried out to participle, only retain noun and verb as the alternative word of Feature Words, then each is not present in to the alternative word of stopping in vocabulary and carries out quantity statistics, be recorded in frequency meter, finally, the alternative word occurring in frequency meter is sorted according to the number that occurs quantity, n the Feature Words obtaining as DF algorithm before selecting, algorithm finishes,
(2), Information Gain Method
The alternative word that each pre-service is obtained is carried out the calculating of the value of information entropy and conditional entropy, after each alternative word having been carried out to the calculated value of introducing entropy, according to this value, sort from big to small, n the Feature Words obtaining as IG algorithm before selecting, algorithm finishes;
Wherein, described information entropy, can calculate with following formula
Figure RE-FDA0000447144750000021
Wherein, c ibe i feature, p (c i) expression c ithe frequency occurring;
Described conditional entropy, can calculate with following formula
Figure RE-FDA0000447144750000031
Wherein, p (c i| x j) representation feature x jduring appearance, classification c ithe probability occurring,
Figure RE-FDA0000447144750000032
representation feature x jwhile not occurring, classification c ithe probability occurring;
The entropy that described alternative word is introduced, specifically refers to the information entropy of system and the difference of conditional entropy, can calculate with following formula
Wherein, p (x i) expression x ithe frequency occurring,
Figure RE-FDA0000447144750000034
represent x ithere is no the frequency occurring, H (c) represents the information entropy of two classifications, p (c i| x i) representation feature x iduring appearance, classification c ithe probability occurring, representation feature x jwhile not occurring, classification c ithe probability occurring;
(3), evolution fits the method for inspection
Calculate each alternative word and appear at the record A in normal class, the record B in the normal class not occurring, appears at the record C in improper class, and does not appear at the record D in improper class; Finally by following formula, calculate weights,
Figure RE-FDA0000447144750000036
According to this value, sort from big to small, before selecting, n fits as evolution the Feature Words that the method for inspection obtains, and algorithm finishes.
10. according to the multiprotocol network file content inspection method described in claim 7 or 8, it is characterized in that, in the 4th step, according to the value of the offset field in packet, determine the network layer of text message and the agreement that application layer is used; The order that information in network layer is used for confirming to packet, so that according to the information of the correct sequence reduction application layer of its original transmission, the information specific definition coded system in application layer.
CN201310567527.8A 2013-11-14 2013-11-14 Multi-protocol network file content inspection method Pending CN103617156A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310567527.8A CN103617156A (en) 2013-11-14 2013-11-14 Multi-protocol network file content inspection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310567527.8A CN103617156A (en) 2013-11-14 2013-11-14 Multi-protocol network file content inspection method

Publications (1)

Publication Number Publication Date
CN103617156A true CN103617156A (en) 2014-03-05

Family

ID=50167859

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310567527.8A Pending CN103617156A (en) 2013-11-14 2013-11-14 Multi-protocol network file content inspection method

Country Status (1)

Country Link
CN (1) CN103617156A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106817248A (en) * 2016-12-19 2017-06-09 西安电子科技大学 A kind of APT attack detection methods
CN109656141A (en) * 2019-01-11 2019-04-19 武汉天喻聚联网络有限公司 Violation identification and machine behaviour control method, equipment, storage medium based on artificial intelligence technology
CN109743311A (en) * 2018-12-28 2019-05-10 北京神州绿盟信息安全科技股份有限公司 A kind of WebShell detection method, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003099445A (en) * 2001-09-21 2003-04-04 Telecommunication Advancement Organization Of Japan Sorting key word generation method and program, and recording medium with the program recorded thereon
CN101729542A (en) * 2009-11-26 2010-06-09 上海大学 Multi-protocol information resolving system based on network packet
CN102609714A (en) * 2011-12-31 2012-07-25 哈尔滨理工大学 Novel classifier based on information gain and online support vector machine, and classification method thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003099445A (en) * 2001-09-21 2003-04-04 Telecommunication Advancement Organization Of Japan Sorting key word generation method and program, and recording medium with the program recorded thereon
CN101729542A (en) * 2009-11-26 2010-06-09 上海大学 Multi-protocol information resolving system based on network packet
CN102609714A (en) * 2011-12-31 2012-07-25 哈尔滨理工大学 Novel classifier based on information gain and online support vector machine, and classification method thereof

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杨凯峰: "基于文档频率的特征选择方法", 《计算机工程》 *
高加旺: "基于支持向量机的垃圾邮件过滤模型研究", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106817248A (en) * 2016-12-19 2017-06-09 西安电子科技大学 A kind of APT attack detection methods
CN106817248B (en) * 2016-12-19 2020-10-16 西安电子科技大学 APT attack detection method
CN109743311A (en) * 2018-12-28 2019-05-10 北京神州绿盟信息安全科技股份有限公司 A kind of WebShell detection method, device and storage medium
CN109743311B (en) * 2018-12-28 2021-10-22 绿盟科技集团股份有限公司 WebShell detection method, device and storage medium
CN109656141A (en) * 2019-01-11 2019-04-19 武汉天喻聚联网络有限公司 Violation identification and machine behaviour control method, equipment, storage medium based on artificial intelligence technology

Similar Documents

Publication Publication Date Title
CN106294593B (en) In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
CN104268160A (en) Evaluation object extraction method based on domain dictionary and semantic roles
CN103166802B (en) The construction method of a kind of deterministic stresses and device
CN105138913A (en) Malware detection method based on multi-view ensemble learning
CN104168288A (en) Automatic vulnerability discovery system and method based on protocol reverse parsing
CN102158428B (en) Rapid and high-accuracy junk mail filtering method
CN104391881A (en) Word segmentation algorithm-based log parsing method and word segmentation algorithm-based log parsing system
CN103336766A (en) Short text garbage identification and modeling method and device
CN101996241A (en) Bayesian algorithm-based content filtering method
CN103514238A (en) Sensitive word recognition processing method based on classification searching
CN109831422A (en) A kind of encryption traffic classification method based on end-to-end sequence network
CN101159704A (en) Microcontent similarity based antirubbish method
CN104376108B (en) A kind of destructuring natural language information abstracting method based on the semantic marks of 6W
CN104899230A (en) Public opinion hotspot automatic monitoring system
CN103853700B (en) A kind of event method for early warning found based on region and object information
CN104021217B (en) System and method for extracting fragment file and deleted file of mobile phone
CN103617156A (en) Multi-protocol network file content inspection method
CN111314279B (en) Unknown protocol reverse method based on network flow
CN104866558A (en) Training method of social networking account mapping model, mapping method and system
CN105095330A (en) Method and system for identifying file format based on compressed package content
CN103366120A (en) Bug attack graph generation method based on script
CN110765231A (en) Chapter event extraction method based on common-finger fusion
CN102436480A (en) Incidence relation excavation method for text-oriented knowledge unit
CN101794378B (en) Rubbish image filtering method based on image encoding
CN108920694B (en) Short text multi-label classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140305

RJ01 Rejection of invention patent application after publication